Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Kubernetes version on Bare Metal Bottlerocket EKS-A cluster fails: kube-vip can't bind IP #6535

Closed
max-boehm opened this issue Aug 18, 2023 · 4 comments
Labels
area/providers/tinkerbell Tinkerbell provider related tasks and issues area/upgrades external An issue, bug or feature request filed from outside the AWS org
Milestone

Comments

@max-boehm
Copy link

What happened: Upgrade of a newly created Bare Metal EKS-A cluster to a new Kubernetes version (e.g. 1.26->1.27) for a standalone cluster with Bottlerocket fails. For other node OS, e.g. Ubuntu, it works. The kube-vip pod in the kube-system namespace on the new control plane node reports "listen:listen tcp :2112: bind: address already in use\n". The cluster endpoint IP then cannot be reached any longer by any node ("connect: connection refused"), etcd cannot be reached, making the cluster inaccessible and the upgrade fail.

What you expected to happen: Upgrade to succeed.

How to reproduce it (as minimally and precisely as possible): Creating a minimal cluster config with 1 control plane node and 1 worker node, kubernetesVersion 1.26, Tinkerbell provider, and Bottlerocket is sufficient. Then invoke eksctl anywhere create cluster .... After this has succeeded, update the kubernetesVersion to 1.27 and update the osImageURL in the config file. Then invoke eksctl anywhere upgrade cluster .... This results in an inaccessible cluster and a failed upgrade.

Anything else we need to know?: In my test the new Bottlerocket control plane node showed the following kube-vip container logs. The second kube-vip container did not start correctly and did not advertise the cluster endpoint IP, which was in my case 10.20.73.115:

bash-5.1# cd /var/log/containers/
bash-5.1# head -20 kube-vip*
==> kube-vip-5bbc4_eksa-system_kube-vip-a34674a29b068ed38bcc6b2e3dadef41f6f383197e2c0398ce63290d2b6d4a67.log <==
2023-08-18T11:22:59.127646987Z stderr F time="2023-08-18T11:22:59Z" level=info msg="Starting kube-vip.io []"
2023-08-18T11:22:59.127674975Z stderr F time="2023-08-18T11:22:59Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[false], Services:[true]"
2023-08-18T11:22:59.127682196Z stderr F time="2023-08-18T11:22:59Z" level=info msg="No interface is specified for VIP in config, auto-detecting default Interface"
2023-08-18T11:22:59.12829605Z stderr F time="2023-08-18T11:22:59Z" level=info msg="kube-vip will bind to interface [ens160]"
2023-08-18T11:22:59.128308091Z stderr F time="2023-08-18T11:22:59Z" level=info msg="prometheus HTTP server started"
2023-08-18T11:22:59.128910598Z stderr F time="2023-08-18T11:22:59Z" level=info msg="Starting Kube-vip Manager with the ARP engine"
2023-08-18T11:22:59.128924275Z stderr F time="2023-08-18T11:22:59Z" level=info msg="beginning watching services, leaderelection will happen for every service"
2023-08-18T11:22:59.128927705Z stderr F time="2023-08-18T11:22:59Z" level=info msg="starting services watcher for all namespaces"
2023-08-18T11:22:59.140829982Z stderr F time="2023-08-18T11:22:59Z" level=info msg="[endpoint] watching for service [envoy] in namespace [eksa-system]"
2023-08-18T11:25:30.62989823Z stderr F E0818 11:25:30.616435       1 retrywatcher.go:130] "Watch failed" err="Get \"https://10.96.0.1:443/api/v1/services?watch=true\": dial tcp 10.96.0.1:443: connect: connection refused"
2023-08-18T11:25:30.629933245Z stderr F E0818 11:25:30.616591       1 retrywatcher.go:130] "Watch failed" err="Get \"https://10.96.0.1:443/api/v1/namespaces/eksa-system/endpoints?fieldSelector=metadata.name%3Denvoy&watch=true\": dial tcp 10.96.0.1:443: connect: connection refused"

==> kube-vip-bottle-upgrade-9fh92_kube-system_kube-vip-d1a5ea0bb61f7b8a7928745c733b2940622bb8e8592791191aa4fc91aa8905f8.log <==
2023-08-18T11:39:15.275140488Z stderr F time="2023-08-18T11:39:15Z" level=info msg="Starting kube-vip.io []"
2023-08-18T11:39:15.275164946Z stderr F time="2023-08-18T11:39:15Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[true], Services:[false]"
2023-08-18T11:39:15.275174573Z stderr F time="2023-08-18T11:39:15Z" level=info msg="No interface is specified for VIP in config, auto-detecting default Interface"
2023-08-18T11:39:15.275360542Z stderr F time="2023-08-18T11:39:15Z" level=info msg="prometheus HTTP server started"
2023-08-18T11:39:15.277148184Z stderr F time="2023-08-18T11:39:15Z" level=fatal msg="listen:listen tcp :2112: bind: address already in use\n"

Environment:

  • EKS Anywhere Release: 0.16.2
  • EKS Distro Release:
@max-boehm
Copy link
Author

Interestingly, the problem does not occur in tests where the physical setup (Bare Metal machines, dedicated L2 network) is simulated in a virtualization environment such as kvm or vmware.

@vivek-koppuru vivek-koppuru added the external An issue, bug or feature request filed from outside the AWS org label Aug 28, 2023
@vivek-koppuru vivek-koppuru added this to the oncall milestone Aug 28, 2023
@vivek-koppuru vivek-koppuru added the area/providers/tinkerbell Tinkerbell provider related tasks and issues label Aug 28, 2023
@max-boehm
Copy link
Author

I run into this problem again, when upgrading the cluster from EKS Aanywhere 0.16.2 to 0.17.3. Here are some more findings:

The static kube-vip-<clustername> pod for advertising the IP address of the cluster endpoint fails to start in the new nodes. The reason is, that another kube-vip pod (from the Tinkerbell helm chart deployment) still runs in the cluster and has acquired the port 2112. The kube-vip-<clustername> pod also wants to bind to that port and fails with the message "tcp port 2112 is already in use".

@max-boehm
Copy link
Author

Each time an upgrade is started this problem occurs again. I have not found the root cause, but a way to correct it is to temporarily move the file /etc/kubernetes/manifests/kube-vip out of its folder and restart containerd. Then move it back and restart containerd again. This terminates the process which has held the port 2112. Thereafter a new kube-vip process starts correctly.

@jacobweinstock
Copy link
Member

Hey @max-boehm , this is fixed in eks anywhere > v0.20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/providers/tinkerbell Tinkerbell provider related tasks and issues area/upgrades external An issue, bug or feature request filed from outside the AWS org
Projects
None yet
Development

No branches or pull requests

4 participants