Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent losing connection when there is a leader change and a disk pressure #6304

Closed
zifeo opened this issue Jul 10, 2024 · 26 comments
Closed

Comments

@zifeo
Copy link

zifeo commented Jul 10, 2024

Environmental Info:
RKE2 Version: v1.30.1+rke2r1

Node(s) CPU architecture, OS, and Version: x86, Ubuntu Jammy

Cluster Configuration: 3 servers, 3 agents

Describe the bug:

This is likely something similar to #5949. While troubleshooting further, it seems that an agent under disk pressure might not re-balance or re-connect to the new leading api server correctly.

Steps To Reproduce:

Setup a cluster with 3 servers and 3 agents, check which IP is first in rke2-agent-load-balancer.json, create some disk pressure on the agent (e.g. too many images to be schedule on that node) and remove the leading server and ensure the replacement node has a different IP. The agent will then loose connection.

Expected behavior:

Correctly load-balancing and reconnecting to the new leader when back up from disk pressure.

Actual behavior:

The agent needs to be manually restarted for the connection to restore.

Additional context / logs:

log.zip

@brandond
Copy link
Member

might not re-balance or re-connect to the new leading api server correctly

The apiserver does not have a leader, only controllers have leaders. etcd also has a leader, but this is transparent to etcd clients. Which specific controller or lease are you seeing associated with this behavior?

@zifeo
Copy link
Author

zifeo commented Jul 10, 2024

@brandond bad phrasing, the agent looses connection to the (leading? not sure of that point) proxy and does not reconnect to another one (also note the typo "reconecting"). It keeps retrying...

Jul 10 16:06:01 cloud-b-1 rke2[1081]: time="2024-07-10T16:06:01Z" level=info msg="Connecting to proxy" url="wss://192.168.42.239:9345/v1-rke2/connect"
Jul 10 16:06:01 cloud-b-1 rke2[1081]: time="2024-07-10T16:06:01Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.42.239:9345: connect: connection refused"
Jul 10 16:06:01 cloud-b-1 rke2[1081]: time="2024-07-10T16:06:01Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 192.168.42.239:9345: connect: connection refused" url="wss://192.168.42.239:9345/v1-rke2/connect"

@brandond
Copy link
Member

Are the logs from the 192.168.42.239 node included in the zip?

If it is refusing connections, then the rke2-server service on that node is not running for some reason. Is it crashing?

@zifeo
Copy link
Author

zifeo commented Jul 10, 2024

@brandond Sadly, we did not have .239 debug enabled. I suspect that the api server run out of memory and rebooted, leading to a server change (from kube-vip), but just temporary as when we intervened all the servers were healthy.

@brandond
Copy link
Member

The logs from this agent only go back a few minutes prior to when it disconnected from the server. Do you have logs going back to the previous start of the service?

@zifeo
Copy link
Author

zifeo commented Jul 10, 2024

@brandond full log are here agent-lost.log.zip.

@brandond
Copy link
Member

brandond commented Jul 10, 2024

What was the actual sequence of events here, on the other servers? I see the agent getting disconnected from that server here:

Jul 10 16:04:58 cloud-b-1 rke2[1081]: time="2024-07-10T16:04:58Z" level=error msg="Remotedialer proxy error; reconecting..." error="websocket: close 1006 (abnormal closure): unexpected EOF" url="wss://192.168.42.239:9345/v1-rke2/connect"

However that does not trigger any failover of the apiserver load-balancer, as there were no active connections to that node when it failed. The load-balancer had failed over to a different server almost 8 days earlier:

Jul 02 20:28:52 cloud-b-1 rke2[1081]: time="2024-07-02T20:28:52Z" level=debug msg="Dial error from load balancer rke2-api-server-agent-load-balancer: dial tcp 192.168.42.239:6443: connect: connection refused"
Jul 02 20:28:52 cloud-b-1 rke2[1081]: time="2024-07-02T20:28:52Z" level=debug msg="Failed over to new server for load balancer rke2-api-server-agent-load-balancer: 192.168.42.160:6443"

There was a bunch of thrashing for a few minutes before that, where 192.168.42.160 and 192.168.42.150 kept taking turns leaving and joining the loadbalancer server addresses list. Then everything seemed to settle down until the 10th, when the remotedialer websocket connection to 192.168.42.239 was lost, followed by a restart of the agent. It appears that following the restart of the agent, 192.168.42.239 is no longer online at all? Its no longer in the loadbalancer server addresses lists.

I can't really make heads or tails of it, without knowing what was going on with these servers at the time.

@zifeo
Copy link
Author

zifeo commented Jul 10, 2024

@brandond Thanks for the insight. The restart was a manual intervention by us. The turn was likely the api server getting oom. Let me try to get the logs for the whole cluster on the next occurence.

@brandond
Copy link
Member

What made you decide to restart it at that point?

@brandond
Copy link
Member

brandond commented Jul 10, 2024

If your server nodes are under memory pressure, you might consider adding some reservations for your critical pods, via the control-plane-resource-requests and control-plane-resource-limits config options: https://docs.rke2.io/advanced#control-plane-component-resource-requestslimits

We don't set these by default, as the required resources are highly environment specific. You should baseline your current utilization, and then set the appropriate requests and limits.

@brandond
Copy link
Member

Also, if possible - please update to v1.30.2 when you get a chance, it is possible you're running into k3s-io/k3s#10279

@brandond
Copy link
Member

brandond commented Jul 11, 2024

@zifeo based on the logs it looks like you're using a load-balancer (192.168.42.4) as the fixed registration address (--server address) for your nodes. Is that correct? How are you hosting this endpoint? Are you by any chance using kube-vip or metallb to expose a Kubernetes service at this address?

@zifeo
Copy link
Author

zifeo commented Jul 11, 2024

@brandond

  • I restarted the agent because I recognized the forever connecting loop
  • Yes, we have those setups. I assume some CRDs were causing the api server to go higher than the limits...
  • Will do
    Correct, 192.168.42.4 is the vip manipulated by kube-vip

@brandond
Copy link
Member

brandond commented Jul 11, 2024

OK. So far kube-vip appears to be the common denominator between this issue and #6208 - so I think we're running into the same thing. As discussed at #6208 (comment) it seems like perhaps kube-proxy's iptables rules may be interfering with connections to the VIP and preventing failover to a new endpoint.

@zifeo
Copy link
Author

zifeo commented Jul 11, 2024

@brandond One more item, we are using cilium in strict kube-proxy replacement.

@brandond
Copy link
Member

I believe cilium will do the same thing as kube-proxy with regards to locally redirecting loadbalancer service IP traffic. I added some comments to the other issue regarding a beta Kubernetes feature that can be used to disable it when using kube-proxy. I don't know if there is any similar way to disable that when using cilium's kube-proxy replacement.

@zifeo
Copy link
Author

zifeo commented Jul 11, 2024

@brandond Is there a name for this in kube-proxy so I can investigate on cilium side? Note that this never happened before 1.30 also.

@brandond
Copy link
Member

brandond commented Jul 11, 2024

LoadBalancerIPMode is the FeatureGate, here is the KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1860-kube-proxy-IP-node-binding

@zifeo
Copy link
Author

zifeo commented Jul 11, 2024

@brandond Thanks for the insights. After looking into it, I am not quite sure to follow. As far as I understand, kube-vip manipulates its vip outside of Kubernetes with arp broadcast, giving an ip outside of the cluster ranges to point at a leading server and then that leader will forward the request using IPVS. Can you explain a bit more how that could be linked to the KEP-1860?

@brandond
Copy link
Member

brandond commented Jul 11, 2024

kube-vip and other LoadBalancer controllers put the VIP address in the .status.loadBalancer.ingress.ip field on the Service. Kube-proxy and kube-proxy replacements like cilium will add rules that directly route traffic for that VIP address to the backend endpoints for that Service, bypassing the load-balancer entirely, and directly hitting the backend nodes that kube-proxy believes should host that service. This is fine, until kube-proxy or cilium lose their connection to the apiserver, and the locally injected endpoints no longer match the endpoints that the VIP would actually send traffic to.

That KEP allows for disabling that kube-proxy behavior with a new field in the .status.loadBalancer.ingress structure.

@brandond
Copy link
Member

tl;dr the arp broadcast for the VIP doesn't matter to cluster members, because kube-proxy or cilium's kube-proxy replacement bypass the VIP entirely. The VIP is only used outside the cluster, or before kube-proxy or cilium are running.

@zifeo
Copy link
Author

zifeo commented Jul 11, 2024

@brandond Oh, I see. This should only happen when you use kube-vip to manage the load-balancing of services. Actually, this is disabled on our end and we only use the "service-less" control plane load-balancing so outside of the cluster.

@brandond
Copy link
Member

brandond commented Jul 11, 2024

Can you provide more details on your kube-vip deployment, including the yaml spec of the service that is hosting that VIP? See the info provided at #6208 (comment) as an example.

So far, kube-vip is the primary difference that is common to these two environments, that we do not generally use when testing RKE2.

@zifeo
Copy link
Author

zifeo commented Jul 11, 2024

@brandond There is no service, only the following static pod on each servers which is moved at startup time to /var/lib/rancher/rke2/agent/pod-manifests/kube-vip.yaml: kube-vip.yaml.

@brandond
Copy link
Member

OK. That's interesting. I'll have to try with that as well. So far kube-vip is the only thing common to both environments, regardless of configuration.

@brandond
Copy link
Member

OK, so with that kube-vip manifest I was able to find the issue. It is not kube-vip's fault, but for some reason I was able to reproduce the issue while using kube-vip, when I previously had not been able to do so. It is the same thing as #6208, so I am going to close this out and follow up there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants