Upgrade Kubernetes version on Bare Metal Bottlerocket EKS-A cluster fails: kube-vip can't bind IP #6535

max-boehm · 2023-08-18T18:07:30Z

What happened: Upgrade of a newly created Bare Metal EKS-A cluster to a new Kubernetes version (e.g. 1.26->1.27) for a standalone cluster with Bottlerocket fails. For other node OS, e.g. Ubuntu, it works. The kube-vip pod in the kube-system namespace on the new control plane node reports "listen:listen tcp :2112: bind: address already in use\n". The cluster endpoint IP then cannot be reached any longer by any node ("connect: connection refused"), etcd cannot be reached, making the cluster inaccessible and the upgrade fail.

What you expected to happen: Upgrade to succeed.

How to reproduce it (as minimally and precisely as possible): Creating a minimal cluster config with 1 control plane node and 1 worker node, kubernetesVersion 1.26, Tinkerbell provider, and Bottlerocket is sufficient. Then invoke eksctl anywhere create cluster .... After this has succeeded, update the kubernetesVersion to 1.27 and update the osImageURL in the config file. Then invoke eksctl anywhere upgrade cluster .... This results in an inaccessible cluster and a failed upgrade.

Anything else we need to know?: In my test the new Bottlerocket control plane node showed the following kube-vip container logs. The second kube-vip container did not start correctly and did not advertise the cluster endpoint IP, which was in my case 10.20.73.115:

bash-5.1# cd /var/log/containers/
bash-5.1# head -20 kube-vip*
==> kube-vip-5bbc4_eksa-system_kube-vip-a34674a29b068ed38bcc6b2e3dadef41f6f383197e2c0398ce63290d2b6d4a67.log <==
2023-08-18T11:22:59.127646987Z stderr F time="2023-08-18T11:22:59Z" level=info msg="Starting kube-vip.io []"
2023-08-18T11:22:59.127674975Z stderr F time="2023-08-18T11:22:59Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[false], Services:[true]"
2023-08-18T11:22:59.127682196Z stderr F time="2023-08-18T11:22:59Z" level=info msg="No interface is specified for VIP in config, auto-detecting default Interface"
2023-08-18T11:22:59.12829605Z stderr F time="2023-08-18T11:22:59Z" level=info msg="kube-vip will bind to interface [ens160]"
2023-08-18T11:22:59.128308091Z stderr F time="2023-08-18T11:22:59Z" level=info msg="prometheus HTTP server started"
2023-08-18T11:22:59.128910598Z stderr F time="2023-08-18T11:22:59Z" level=info msg="Starting Kube-vip Manager with the ARP engine"
2023-08-18T11:22:59.128924275Z stderr F time="2023-08-18T11:22:59Z" level=info msg="beginning watching services, leaderelection will happen for every service"
2023-08-18T11:22:59.128927705Z stderr F time="2023-08-18T11:22:59Z" level=info msg="starting services watcher for all namespaces"
2023-08-18T11:22:59.140829982Z stderr F time="2023-08-18T11:22:59Z" level=info msg="[endpoint] watching for service [envoy] in namespace [eksa-system]"
2023-08-18T11:25:30.62989823Z stderr F E0818 11:25:30.616435       1 retrywatcher.go:130] "Watch failed" err="Get \"https://10.96.0.1:443/api/v1/services?watch=true\": dial tcp 10.96.0.1:443: connect: connection refused"
2023-08-18T11:25:30.629933245Z stderr F E0818 11:25:30.616591       1 retrywatcher.go:130] "Watch failed" err="Get \"https://10.96.0.1:443/api/v1/namespaces/eksa-system/endpoints?fieldSelector=metadata.name%3Denvoy&watch=true\": dial tcp 10.96.0.1:443: connect: connection refused"

==> kube-vip-bottle-upgrade-9fh92_kube-system_kube-vip-d1a5ea0bb61f7b8a7928745c733b2940622bb8e8592791191aa4fc91aa8905f8.log <==
2023-08-18T11:39:15.275140488Z stderr F time="2023-08-18T11:39:15Z" level=info msg="Starting kube-vip.io []"
2023-08-18T11:39:15.275164946Z stderr F time="2023-08-18T11:39:15Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[true], Services:[false]"
2023-08-18T11:39:15.275174573Z stderr F time="2023-08-18T11:39:15Z" level=info msg="No interface is specified for VIP in config, auto-detecting default Interface"
2023-08-18T11:39:15.275360542Z stderr F time="2023-08-18T11:39:15Z" level=info msg="prometheus HTTP server started"
2023-08-18T11:39:15.277148184Z stderr F time="2023-08-18T11:39:15Z" level=fatal msg="listen:listen tcp :2112: bind: address already in use\n"

Environment:

EKS Anywhere Release: 0.16.2
EKS Distro Release:

The text was updated successfully, but these errors were encountered:

max-boehm · 2023-08-25T06:02:09Z

Interestingly, the problem does not occur in tests where the physical setup (Bare Metal machines, dedicated L2 network) is simulated in a virtualization environment such as kvm or vmware.

max-boehm · 2023-09-30T12:49:36Z

I run into this problem again, when upgrading the cluster from EKS Aanywhere 0.16.2 to 0.17.3. Here are some more findings:

The static kube-vip-<clustername> pod for advertising the IP address of the cluster endpoint fails to start in the new nodes. The reason is, that another kube-vip pod (from the Tinkerbell helm chart deployment) still runs in the cluster and has acquired the port 2112. The kube-vip-<clustername> pod also wants to bind to that port and fails with the message "tcp port 2112 is already in use".

max-boehm · 2023-10-25T18:30:34Z

Each time an upgrade is started this problem occurs again. I have not found the root cause, but a way to correct it is to temporarily move the file /etc/kubernetes/manifests/kube-vip out of its folder and restart containerd. Then move it back and restart containerd again. This terminates the process which has held the port 2112. Thereafter a new kube-vip process starts correctly.

jacobweinstock · 2024-07-22T17:18:48Z

Hey @max-boehm , this is fixed in eks anywhere > v0.20.

csplinter added the area/upgrades label Aug 24, 2023

vivek-koppuru added the external An issue, bug or feature request filed from outside the AWS org label Aug 28, 2023

vivek-koppuru added this to the oncall milestone Aug 28, 2023

vivek-koppuru added the area/providers/tinkerbell Tinkerbell provider related tasks and issues label Aug 28, 2023

This was referenced Dec 6, 2023

Add context to "Watch failed" kubernetes/client-go#1321

Open

Add context to "Watch failed" kubernetes/kubernetes#122207

Open

jacobweinstock closed this as completed Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade Kubernetes version on Bare Metal Bottlerocket EKS-A cluster fails: kube-vip can't bind IP #6535

Upgrade Kubernetes version on Bare Metal Bottlerocket EKS-A cluster fails: kube-vip can't bind IP #6535

max-boehm commented Aug 18, 2023

max-boehm commented Aug 25, 2023

max-boehm commented Sep 30, 2023

max-boehm commented Oct 25, 2023

jacobweinstock commented Jul 22, 2024

Upgrade Kubernetes version on Bare Metal Bottlerocket EKS-A cluster fails: kube-vip can't bind IP #6535

Upgrade Kubernetes version on Bare Metal Bottlerocket EKS-A cluster fails: kube-vip can't bind IP #6535

Comments

max-boehm commented Aug 18, 2023

max-boehm commented Aug 25, 2023

max-boehm commented Sep 30, 2023

max-boehm commented Oct 25, 2023

jacobweinstock commented Jul 22, 2024