-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997
Comments
Connects to internal reference SURE-3208 amongst others potentially (SURE-4484) |
Whoever takes this on, @cmurphy is happy to help. |
As noted in the linked PR (#6024), we have a fix for this via Harvester but this needs to be reviewed first. |
@rak-phillip to look this over and sync with @cmurphy to see if the existing PR helps out here or if other work is required. Then we'd need to test. |
Can we also get an estimate on when this would have started happening. Has it existed since the beginning of dashboard, or was it introduced at some point in time ? |
I spoke with @cmurphy offline and I think we should merge rancher/steve#49 because it looks like it fixes the issue that we're experiencing today. We can still make some changes to Dashboard to improve on the behavior described above, but the change to Steve should resolve this issue:
|
@cmurphy Was able to reproduce after creating/deleting around 30 deployments and waiting roughly 45 minutes. Observed websockets with sockId1 and sockId3 starting and stopping various watches. websocket1_websocket3_startstop.mp4Observed cattle-agent debug logs confirming that watchers were being opened and closed repeatedly. Observed very high rate of WATCH requests on |
I tried to repro this recently and couldn't - I'm assuming that's because Colleen's PR was merged. I forced an incorrect resource version to be used - a random number - and confirmed that the socket has self-corrected and is still able to watch the resource. Or at least I THINK that is what this screenshot is showing: @rak-phillip That's a great doc that you linked. In response to your comment:
This part of the linked doc makes me think the dashboard shouldn't be comparing resource versions with Math.max() at all:
I think there should probably be a new tech debt issue opened about making some minimum enhancements. I agree that we should accept the Side note on toolsI was also curious about this part of those upstream docs that mention tooling for watching K8s resources:
I wonder why we have an in-house implementation of findAll, or if there is a better tool that can be used with Vue. I found this example of watching a K8s collection with a Javascript client and it looks simpler than what we have. https://github.com/kubernetes-client/javascript/blob/master/examples/cache-example.js#L13 If we used that client it could help us differentiate between lists and individual resources. Side note on performanceBefore reading the linked docs I thought a big reason the UI takes a long time to load was that etcd doesn't support pagination. But in fact, the K8s docs says the K8s API server does support pagination to make UIs faster and reduce server load:
Since the K8s API supports pagination, there should probably be another performance issue about looking into pagination for the list views. |
The flood of requests was corrected, but the dashboard is still doing incorrect behavior that can still be reproduced following the instructions above. This is what it looks like:
Even though this now isn't directly impacting the user, I maintain that step 3 is not the right behavior, that the dashboard should use a valid number from the start instead of having to re-correct after feedback from the backend. |
Manually reopening as github bot has failed to do so |
2.6.x backport: #7779 |
Actual backport ticket: #7707 (7779 closed as duplicate) |
* Moves sockets into the advanced worker * worker can die peacefully now, making switching between cluster work. * Make waitFor generic, wire in to waitForTestFn * General Changes - Fixes for switching cluster - includes using common getPerformanceSetting - avoid new code to unsub before socket disconnect - handle `watch` `stop` requests - lots of TODO's (questions, work, checks, test, etc) - use common * Switch socket fixes - isAdvancedWorker should only be true for cluster store - advancedWorker to be wired in * Fix socket id for cluster workers - sockets use an incremented local var for id - when we nuke the socket file within the worker this resets, so they all ahve id of 1 - work around this by applying the unix time * Fix handling of new partical counts response - seen in dex cluster explorer dashboard - count cards would be removed when partial counts response received * Make resourceWatcher the sole location for watch state - getters canWatch, watchStarted now are worked around (they look at state in the UI thread) - we now don't call resource.stop or restart.start in subscription - tidied up `forgetType` - moved clearFromQueue from steve mutations into subscription mutations (better location) - added and removed some TODOs - fixed watch (stop handler should be higher up, include force watch handling) * pushes the csrf value into worker and adds it to fetch request headers. * refactors batchChanges to address ref concerns and be more performant * Maintain schema reference whilst updating - This change mutates input in a function, which is bad... - but ensures the reference isn't broken, which is needed to maintain similar functionality as before * Fix waitForTestFn - Seen when creating or viewing clusters * On unwatch ensure any pending watch requests are removed from the queue - the probably would have been a problem if the worker wasn't nuked - however as the codes there lets make it safe Also added `trace` feature in advanced worker, will probably bring out to other places as well * Fix navigation from cluster manager world to any cluster - Ensure that we handle the case where the advanced worker was created but the resource watcher wasn't - ... but fix case where this was happening (aka ensure that a blank cluster context is ignored) * Tidy some TODOs * Add perf settings page - This will help test normal flow (when advanced worker is disabled) - Note - setting is now in a bag. This may help us better support further settings (enable client side pagination, etc) ``` advancedWorker: { enabled: false }, ``` * FIX - Nav from cluster dashboard --> specific event --> cluster dashboard and events not re-subbed - Ensure we block default handling of resource.start (keep state in resource watcher) * Tidying up some TODOs * Adds in a cache and uses it to validate SCHEMA messages before batching. * Forgot to actually save CSRF to the resourceWatcher when instantiated. * an empty resource in a batchChange to signal remove * Move addSchemaIndexFields to and created removeSchemaIndexFields in new file - this avoids bringing class files into the worker * Fix disconnect/reconnect - Remove `syncWatch` (do the watch/unwatch straight away) - Test/Fix re-sub on reconnect - Test/Fix growls on disconnect * Tidying up some TODO's - including clean of workerQueue on resource.stop (this is SUPER defensive) * batchChanges will now handle aliases * Fix pods list - WIP - ensure podsByNamespace is updated on batchChange TODO - the final update to the pod is ignored - removing a namespace cleans the cache correctly - disabling advanced worker still works * Fix pods list - fixes - ensure podsByNamespace is updated on batchChange Tested / Fixed - the final update to the pod is ignored - removing a namespace cleans the cache correctly - disabling advanced worker still works * Tidying TODOs * Remove default same-origin header - https://developer.mozilla.org/en-US/docs/Web/API/Request/credentials * Fixed TODO description * Refactor subscribe, make it clear which vuex feature relates to what * Lots of Fixes - batchChanges fixes - fix index is 0 issues (!/!!index) - only `set` if we have to - ensure we set the correct index after pushing to list - ensure map is updated after reducing list size with limit - podsByNamespace fixes - ensure when ew replace... we don't use the same referenced object - general service resource fixes - ensure service's pods list stays up to date with store * Multiple improvements/fixes - resourceCache - store the hash instead of the whole object. This means longer load time be reduces memory footprint - resourceWatcher - don't re-sub on socket reconnect if watcher is in error - don't sub if watcher is in error - don't unwatch for 'failed to find schema' and 'too old' errors - this clears the error, we won't to keep it to ensure we don't watch - Remove #5997 comments, follow on work #7917 * toggle debug, remap alias types, cleaned up comments and console * Unit tests for batchChanges Much more scope for some crazy content * Logging tweaks - disable logging by default - initWorker comes in too late to affect initial trace, so just rely on the `debug` to toggle at runtime Co-authored-by: Richard Cox <richard.cox@suse.com>
I have retested this issue and from my understanding of this issue and the merged PRs, this issue should be closed. I am not seeing repetitions of the same websocket IDs and it seems that the stop -> start watch behavior now uses sensible version #s. |
I'm running Rancher 2.6.11 and I'm not convinced that the backported fix in #7719 has actually resolved the problem. We are still seeing surges of cattle-cluster-agent watch start/stop events: These surges are correlated with high CPU and memory utilization on kube-apiserver as well as high latency in the Rancher UI and kubectl commands. |
@cmurphy should I open a new issue about this, or can/should we reopen this one? Or should we abandon 2.6.x and try to upgrade to 2.7.2+? |
@skaven81 apologies for the delayed response. It would be a good idea to upgrade to 2.7.3 but I don't know if it would help with this issue. The main fix to the watch subscription spam issue was actually in rancher/steve#49 which went into 2.6.7, so this should have already been addressed. In your graphs, it looks like there are predictable surges which resolve themselves in a regular pattern. In this original issue, the re-subscription attempts were increasing at a constant rate after the initial timeout, so I would expect the graphs to continue on a straight path until the pod crashes and the line would fall off a cliff to zero. I think the surges you are seeing are probably normal and the performance issues you are seeing on those intervals could be alleviated by adding more memory and CPU. If you disagree, I would recommend opening a new issue because the symptoms you've described don't fully match the original issue. |
That's not "predictable surges" shown in the graph. The entire width of that zigzag line is a single event. Not a series of short ones. The behavior we see is that a Rancher UI browser session will hit some sort of corner case condition (probably something similar to what was fixed in rancher/steve#49 with watch subscriptions timing out), and at that point, we see a huge spike of kube-apiserver calls from the cattle-cluster-agent using the agent/v0.0.0 userAgent string. The kube-apiserver calls have the "acting as" metadata associated so we can see that the calls are being executed on behalf of the user whose browser has gone haywire. Interestingly, there is no additonal traffic coming from the web browser -- the traffic is entirely (or mostly) between cattle-cluster-agent and kube-apiserver in the downstream cluster. Not between the browser and rancher, and not between rancher and the cattle-cluster-agent. The zig zags you see in the charts are zoomed into one of these events. It shows how the cattle-cluster-agent begins logging a series of watch stop/start events, while simultaneously hammering the kube-apiserver in the same pattern.
Yeah that's definitely not what we see. It has always been this sudden increase of kube-apiserver traffic for a single user (associated with one browser session) which then holds steady until the user closes their browser or refreshes the tab, at which point I guess the watches get re-established and kube-apiserver traffic for that user drops back to a normal level. I'll open a new case for this. |
Thanks! |
@cmurphy I have created rancher/rancher#41663 to track the other related issue. |
Setup
Describe the bug
The Vue dashboard uses websockets to subscribe to kubernetes events in the downstream cluster. The watch has a default timeout of 30 minutes. If the dashboard is left open for more than that, the watch will expire and the dashboard will attempt to resubscribe. However, the watch request requires a resource version to start the watch from, and the dashboard is providing an invalid version, causing the subscription to fail. The page must be refreshed in order to properly re-start the watch. In extreme cases with a large scale, this causes the kubernetes API server to become overloaded.
To Reproduce
Create a downstream cluster. Turn on debug logging for the agent by setting
CATTLE_DEBUG
totrue
in the environment variable configuration for the cluster. Turn on verbose logging for the kubernetes API server by editing the YAML, and under rancher_kubernetes_engine_config->services->kube-api add a sectionWhen the cluster is running, navigate to "Deployments" under "Workloads". Create at least one workload, like a Deployment or Statefulset, or deploy an App and come back to the workloads tab. Open the Network tab.
EDIT reproducing this depends on having the List resource version be extremely skewed from the resource version of the item that is picked as a base. To reproduce this consistently, create AND delete several (i.e. tens to hundreds) of Deployments. THEN navigate to the Workloads tab and refresh. If the watch is started while Deployments are being created and deleted, the resource version used may come from one of the websocket updates, and the problem will not be visible, but if there is no activity happening, it will be picked from the last list response.
Result
In the Network tab, you can see the initial requests for the list of objects relevant to Workloads, for example:
➡️
In one of the websocket requests (for me it's always
wss://<server>/k8s/clusters/<cluster>/v1/subscribe?sockId=3
), you can see an outgoing websocket message to start a watch for that resource type and version:🔼
{"resourceType":"apps.statefulset","resourceVersion":"400478"}
After 30 minutes, the watch is closed (which is expected, because of the default timeout):
🔽
{name: "resource.stop", resourceType: "apps.statefulset", data: {type: "subscribe", links: {}}}
We then see a new outgoing watch request, but now for a completely different resource version:
🔼
{"resourceType":"apps.statefulset","resourceVersion":"17159"}
The watch immediately fails and is closed:
🔽
{name: "resource.stop", resourceType: "apps.statefulset", data: {type: "subscribe", links: {}}}
and the cycle continues.
In the cluster agent debug logs, you can see the watch being stopped and started on repeat:
On the kubernetes API server, you will see it being hammered by repeat GET requests with the URL including
watch=true
.It turns out this resourceVersion is from one individual item in the list of resources, see JSON above. This is not the revision number returned for the collection by steve. Steve fails to initialize the watch because the resource version is "too old".
Note that kubernetes uses the field
resourceVersion
both for the version of the List and the versions of the individual resources, where steve changes the List'sresourceVersion
torevision
.When re-subscribing after a timeout, the revision number for the whole collection needs to be used, not the resource version number of any one item in the collection.
Expected Result
After a timeout, a new watch is started successfully and is ongoing until the next timeout.
Screenshots
Additional context
This may not be 100% reproducible because sometimes the item resource version is close enough to the collection revision that the watch still works, but this is an accident. If you deploy additional workloads and apps without making updates to any of them, eventually the collection revision number will get significantly bigger while the original item version number stays the same, and you should see the issue occur.
To help with debugging, I've proposed a change to steve which will
resource.error
event through the websocket if the connection is closed irregularlyCATTLE_WATCH_TIMEOUT_SECONDS
to, e.g. 300 (5 minutes) in the cluster agent environment.The agent image with this change is published as
EDIT: new image with latest changes atcmurpheus/rancher-agent:v2.6-head-0e2fdc-steve-debug
.cmurpheus/rancher-agent:v2.6-head-f4e1de7-steve-debug
The text was updated successfully, but these errors were encountered: