Conflicting error messages on a non-functioning cluster #557

evilnick · 2024-07-17T12:36:57Z

Summary

In a multi-node cluster where one of the control-plane nodes has disappeared:

ubuntu@able-antelope:~$ sudo k8s status
Error: The node is not part of a Kubernetes cluster. You can bootstrap a new cluster with:

  sudo k8s bootstrap
ubuntu@able-antelope:~$ sudo k8s bootstrap
Error: The node is already part of a cluster
ubuntu@able-antelope:~$

What Should Happen Instead?

The first error is wrong. Status should instead report that the cluster is not in a working state, rather than it has not been bootstrapped

Reproduction Steps

Set up a cluster with two or more control planes
Remove one of the control planes
run status on the one which still exists

System information

inspection-report-20240717_133154.tar.gz

Can you suggest a fix?

No response

Are you interested in contributing with a fix?

No response

The text was updated successfully, but these errors were encountered:

bschimke95 · 2024-08-22T07:21:07Z

@HomayoonAlimohammadi this should be addressed by #564, right?

HomayoonAlimohammadi · 2024-08-22T07:28:17Z

@bschimke95 I think we still had some problems which angelos fixed in #599
lemme try to reproduce this issue

HomayoonAlimohammadi · 2024-08-22T07:46:53Z

so here's what I did:

created a 3 node cluster (all cp)
killed one of the nodes (lxc delete --force)
run k8s status on one of the remaining nodes:
- the first time I got a deadline exceeded
- second time I got the status but the IP of removed node is still there and we can see heartbeats fail for that node:

Aug 22 07:37:40 cp1 k8s.k8sd[1804]: time="2024-08-22T07:37:40Z" level=error msg="Received error sending heartbeat to cluster member" error="Post \"https://10.97.72.146:6400/core/internal/heartbeat\": Unable to connect to \"10.97.72.146:6400\": dial tcp 10.97.72.146:6400: connect: no route to host" target="10.97.72.146:6400"

I retried this but this time instead of killing one node, did a k8s remove-node and everything seems fine. k8s status shows correct message on all nodes (existing and removed nodes) and the IP is removed. even removed 2 nodes in a 3 cp setup and still everything works fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflicting error messages on a non-functioning cluster #557

Conflicting error messages on a non-functioning cluster #557

evilnick commented Jul 17, 2024

bschimke95 commented Aug 22, 2024

HomayoonAlimohammadi commented Aug 22, 2024

HomayoonAlimohammadi commented Aug 22, 2024 •

edited

Loading

Conflicting error messages on a non-functioning cluster #557

Conflicting error messages on a non-functioning cluster #557

Comments

evilnick commented Jul 17, 2024

Summary

What Should Happen Instead?

Reproduction Steps

System information

Can you suggest a fix?

Are you interested in contributing with a fix?

bschimke95 commented Aug 22, 2024

HomayoonAlimohammadi commented Aug 22, 2024

HomayoonAlimohammadi commented Aug 22, 2024 • edited Loading

HomayoonAlimohammadi commented Aug 22, 2024 •

edited

Loading