Agent losing connection when there is a leader change and a disk pressure #6304

zifeo · 2024-07-10T17:35:48Z

Environmental Info:
RKE2 Version: v1.30.1+rke2r1

Node(s) CPU architecture, OS, and Version: x86, Ubuntu Jammy

Cluster Configuration: 3 servers, 3 agents

Describe the bug:

This is likely something similar to #5949. While troubleshooting further, it seems that an agent under disk pressure might not re-balance or re-connect to the new leading api server correctly.

Steps To Reproduce:

Setup a cluster with 3 servers and 3 agents, check which IP is first in rke2-agent-load-balancer.json, create some disk pressure on the agent (e.g. too many images to be schedule on that node) and remove the leading server and ensure the replacement node has a different IP. The agent will then loose connection.

Expected behavior:

Correctly load-balancing and reconnecting to the new leader when back up from disk pressure.

Actual behavior:

The agent needs to be manually restarted for the connection to restore.

Additional context / logs:

log.zip

brandond · 2024-07-10T17:44:49Z

might not re-balance or re-connect to the new leading api server correctly

The apiserver does not have a leader, only controllers have leaders. etcd also has a leader, but this is transparent to etcd clients. Which specific controller or lease are you seeing associated with this behavior?

zifeo · 2024-07-10T17:48:02Z

@brandond bad phrasing, the agent looses connection to the (leading? not sure of that point) proxy and does not reconnect to another one (also note the typo "reconecting"). It keeps retrying...

Jul 10 16:06:01 cloud-b-1 rke2[1081]: time="2024-07-10T16:06:01Z" level=info msg="Connecting to proxy" url="wss://192.168.42.239:9345/v1-rke2/connect"
Jul 10 16:06:01 cloud-b-1 rke2[1081]: time="2024-07-10T16:06:01Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.42.239:9345: connect: connection refused"
Jul 10 16:06:01 cloud-b-1 rke2[1081]: time="2024-07-10T16:06:01Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 192.168.42.239:9345: connect: connection refused" url="wss://192.168.42.239:9345/v1-rke2/connect"

brandond · 2024-07-10T18:56:03Z

Are the logs from the 192.168.42.239 node included in the zip?

If it is refusing connections, then the rke2-server service on that node is not running for some reason. Is it crashing?

zifeo · 2024-07-10T20:59:44Z

@brandond Sadly, we did not have .239 debug enabled. I suspect that the api server run out of memory and rebooted, leading to a server change (from kube-vip), but just temporary as when we intervened all the servers were healthy.

brandond · 2024-07-10T21:20:06Z

The logs from this agent only go back a few minutes prior to when it disconnected from the server. Do you have logs going back to the previous start of the service?

zifeo · 2024-07-10T21:22:34Z

@brandond full log are here agent-lost.log.zip.

brandond · 2024-07-10T21:46:56Z

What was the actual sequence of events here, on the other servers? I see the agent getting disconnected from that server here:

Jul 10 16:04:58 cloud-b-1 rke2[1081]: time="2024-07-10T16:04:58Z" level=error msg="Remotedialer proxy error; reconecting..." error="websocket: close 1006 (abnormal closure): unexpected EOF" url="wss://192.168.42.239:9345/v1-rke2/connect"

However that does not trigger any failover of the apiserver load-balancer, as there were no active connections to that node when it failed. The load-balancer had failed over to a different server almost 8 days earlier:

Jul 02 20:28:52 cloud-b-1 rke2[1081]: time="2024-07-02T20:28:52Z" level=debug msg="Dial error from load balancer rke2-api-server-agent-load-balancer: dial tcp 192.168.42.239:6443: connect: connection refused"
Jul 02 20:28:52 cloud-b-1 rke2[1081]: time="2024-07-02T20:28:52Z" level=debug msg="Failed over to new server for load balancer rke2-api-server-agent-load-balancer: 192.168.42.160:6443"

There was a bunch of thrashing for a few minutes before that, where 192.168.42.160 and 192.168.42.150 kept taking turns leaving and joining the loadbalancer server addresses list. Then everything seemed to settle down until the 10th, when the remotedialer websocket connection to 192.168.42.239 was lost, followed by a restart of the agent. It appears that following the restart of the agent, 192.168.42.239 is no longer online at all? Its no longer in the loadbalancer server addresses lists.

I can't really make heads or tails of it, without knowing what was going on with these servers at the time.

zifeo · 2024-07-10T21:53:17Z

@brandond Thanks for the insight. The restart was a manual intervention by us. The turn was likely the api server getting oom. Let me try to get the logs for the whole cluster on the next occurence.

brandond · 2024-07-10T21:55:39Z

What made you decide to restart it at that point?

brandond · 2024-07-10T21:58:30Z

If your server nodes are under memory pressure, you might consider adding some reservations for your critical pods, via the control-plane-resource-requests and control-plane-resource-limits config options: https://docs.rke2.io/advanced#control-plane-component-resource-requestslimits

We don't set these by default, as the required resources are highly environment specific. You should baseline your current utilization, and then set the appropriate requests and limits.

brandond · 2024-07-10T22:37:32Z

Also, if possible - please update to v1.30.2 when you get a chance, it is possible you're running into k3s-io/k3s#10279

brandond · 2024-07-11T18:32:49Z

@zifeo based on the logs it looks like you're using a load-balancer (192.168.42.4) as the fixed registration address (--server address) for your nodes. Is that correct? How are you hosting this endpoint? Are you by any chance using kube-vip or metallb to expose a Kubernetes service at this address?

zifeo · 2024-07-11T18:36:59Z

@brandond

I restarted the agent because I recognized the forever connecting loop
Yes, we have those setups. I assume some CRDs were causing the api server to go higher than the limits...
Will do
Correct, 192.168.42.4 is the vip manipulated by kube-vip

brandond · 2024-07-11T19:09:58Z

OK. So far kube-vip appears to be the common denominator between this issue and #6208 - so I think we're running into the same thing. As discussed at #6208 (comment) it seems like perhaps kube-proxy's iptables rules may be interfering with connections to the VIP and preventing failover to a new endpoint.

zifeo · 2024-07-11T19:14:27Z

@brandond One more item, we are using cilium in strict kube-proxy replacement.

brandond · 2024-07-11T19:38:29Z

I believe cilium will do the same thing as kube-proxy with regards to locally redirecting loadbalancer service IP traffic. I added some comments to the other issue regarding a beta Kubernetes feature that can be used to disable it when using kube-proxy. I don't know if there is any similar way to disable that when using cilium's kube-proxy replacement.

zifeo · 2024-07-11T19:50:03Z

@brandond Is there a name for this in kube-proxy so I can investigate on cilium side? Note that this never happened before 1.30 also.

brandond · 2024-07-11T19:51:26Z

LoadBalancerIPMode is the FeatureGate, here is the KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1860-kube-proxy-IP-node-binding

zifeo · 2024-07-11T20:46:42Z

@brandond Thanks for the insights. After looking into it, I am not quite sure to follow. As far as I understand, kube-vip manipulates its vip outside of Kubernetes with arp broadcast, giving an ip outside of the cluster ranges to point at a leading server and then that leader will forward the request using IPVS. Can you explain a bit more how that could be linked to the KEP-1860?

brandond · 2024-07-11T20:57:43Z

kube-vip and other LoadBalancer controllers put the VIP address in the .status.loadBalancer.ingress.ip field on the Service. Kube-proxy and kube-proxy replacements like cilium will add rules that directly route traffic for that VIP address to the backend endpoints for that Service, bypassing the load-balancer entirely, and directly hitting the backend nodes that kube-proxy believes should host that service. This is fine, until kube-proxy or cilium lose their connection to the apiserver, and the locally injected endpoints no longer match the endpoints that the VIP would actually send traffic to.

That KEP allows for disabling that kube-proxy behavior with a new field in the .status.loadBalancer.ingress structure.

brandond · 2024-07-11T21:00:35Z

tl;dr the arp broadcast for the VIP doesn't matter to cluster members, because kube-proxy or cilium's kube-proxy replacement bypass the VIP entirely. The VIP is only used outside the cluster, or before kube-proxy or cilium are running.

zifeo · 2024-07-11T21:37:10Z

@brandond Oh, I see. This should only happen when you use kube-vip to manage the load-balancing of services. Actually, this is disabled on our end and we only use the "service-less" control plane load-balancing so outside of the cluster.

brandond · 2024-07-11T21:42:42Z

Can you provide more details on your kube-vip deployment, including the yaml spec of the service that is hosting that VIP? See the info provided at #6208 (comment) as an example.

So far, kube-vip is the primary difference that is common to these two environments, that we do not generally use when testing RKE2.

zifeo · 2024-07-11T21:49:37Z

@brandond There is no service, only the following static pod on each servers which is moved at startup time to /var/lib/rancher/rke2/agent/pod-manifests/kube-vip.yaml: kube-vip.yaml.

brandond · 2024-07-11T23:24:37Z

OK. That's interesting. I'll have to try with that as well. So far kube-vip is the only thing common to both environments, regardless of configuration.

brandond · 2024-07-12T09:01:16Z

OK, so with that kube-vip manifest I was able to find the issue. It is not kube-vip's fault, but for some reason I was able to reproduce the issue while using kube-vip, when I previously had not been able to do so. It is the same thing as #6208, so I am going to close this out and follow up there.

brandond closed this as completed Jul 12, 2024

brandond mentioned this issue Jul 12, 2024

Agent loadbalancer may deadlock when servers are removed #6208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent losing connection when there is a leader change and a disk pressure #6304

Agent losing connection when there is a leader change and a disk pressure #6304

zifeo commented Jul 10, 2024

brandond commented Jul 10, 2024

zifeo commented Jul 10, 2024

brandond commented Jul 10, 2024

zifeo commented Jul 10, 2024

brandond commented Jul 10, 2024

zifeo commented Jul 10, 2024

brandond commented Jul 10, 2024 •

edited

Loading

zifeo commented Jul 10, 2024 •

edited

Loading

brandond commented Jul 10, 2024

brandond commented Jul 10, 2024 •

edited

Loading

brandond commented Jul 10, 2024

brandond commented Jul 11, 2024 •

edited

Loading

zifeo commented Jul 11, 2024 •

edited

Loading

brandond commented Jul 11, 2024 •

edited

Loading

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024 •

edited

Loading

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024 •

edited

Loading

brandond commented Jul 11, 2024

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024 •

edited

Loading

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024

brandond commented Jul 12, 2024

Agent losing connection when there is a leader change and a disk pressure #6304

Agent losing connection when there is a leader change and a disk pressure #6304

Comments

zifeo commented Jul 10, 2024

brandond commented Jul 10, 2024

zifeo commented Jul 10, 2024

brandond commented Jul 10, 2024

zifeo commented Jul 10, 2024

brandond commented Jul 10, 2024

zifeo commented Jul 10, 2024

brandond commented Jul 10, 2024 • edited Loading

zifeo commented Jul 10, 2024 • edited Loading

brandond commented Jul 10, 2024

brandond commented Jul 10, 2024 • edited Loading

brandond commented Jul 10, 2024

brandond commented Jul 11, 2024 • edited Loading

zifeo commented Jul 11, 2024 • edited Loading

brandond commented Jul 11, 2024 • edited Loading

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024 • edited Loading

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024 • edited Loading

brandond commented Jul 11, 2024

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024 • edited Loading

zifeo commented Jul 11, 2024

brandond commented Jul 11, 2024

brandond commented Jul 12, 2024

brandond commented Jul 10, 2024 •

edited

Loading

zifeo commented Jul 10, 2024 •

edited

Loading

brandond commented Jul 10, 2024 •

edited

Loading

brandond commented Jul 11, 2024 •

edited

Loading

zifeo commented Jul 11, 2024 •

edited

Loading

brandond commented Jul 11, 2024 •

edited

Loading

brandond commented Jul 11, 2024 •

edited

Loading

brandond commented Jul 11, 2024 •

edited

Loading

brandond commented Jul 11, 2024 •

edited

Loading