Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting error messages on a non-functioning cluster #557

Open
evilnick opened this issue Jul 17, 2024 · 3 comments
Open

Conflicting error messages on a non-functioning cluster #557

evilnick opened this issue Jul 17, 2024 · 3 comments

Comments

@evilnick
Copy link
Contributor

Summary

In a multi-node cluster where one of the control-plane nodes has disappeared:

ubuntu@able-antelope:~$ sudo k8s status
Error: The node is not part of a Kubernetes cluster. You can bootstrap a new cluster with:

  sudo k8s bootstrap
ubuntu@able-antelope:~$ sudo k8s bootstrap
Error: The node is already part of a cluster
ubuntu@able-antelope:~$ 

What Should Happen Instead?

The first error is wrong. Status should instead report that the cluster is not in a working state, rather than it has not been bootstrapped

Reproduction Steps

  1. Set up a cluster with two or more control planes
  2. Remove one of the control planes
  3. run status on the one which still exists

System information

inspection-report-20240717_133154.tar.gz

Can you suggest a fix?

No response

Are you interested in contributing with a fix?

No response

@bschimke95
Copy link
Contributor

@HomayoonAlimohammadi this should be addressed by #564, right?

@HomayoonAlimohammadi
Copy link
Contributor

@bschimke95 I think we still had some problems which angelos fixed in #599
lemme try to reproduce this issue

@HomayoonAlimohammadi
Copy link
Contributor

HomayoonAlimohammadi commented Aug 22, 2024

so here's what I did:

  • created a 3 node cluster (all cp)
  • killed one of the nodes (lxc delete --force)
  • run k8s status on one of the remaining nodes:
    • the first time I got a deadline exceeded
    • second time I got the status but the IP of removed node is still there and we can see heartbeats fail for that node:
Aug 22 07:37:40 cp1 k8s.k8sd[1804]: time="2024-08-22T07:37:40Z" level=error msg="Received error sending heartbeat to cluster member" error="Post \"https://10.97.72.146:6400/core/internal/heartbeat\": Unable to connect to \"10.97.72.146:6400\": dial tcp 10.97.72.146:6400: connect: no route to host" target="10.97.72.146:6400"

I retried this but this time instead of killing one node, did a k8s remove-node and everything seems fine. k8s status shows correct message on all nodes (existing and removed nodes) and the IP is removed. even removed 2 nodes in a 3 cp setup and still everything works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants