Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud and Fedora CoreOS Custom worker nodes doesn't join the cluster #942

Closed
meabed opened this issue Feb 19, 2021 · 8 comments
Closed

Comments

@meabed
Copy link

meabed commented Feb 19, 2021

Description
But Custom worker nodes doesn't join the cluster, and some workers node if it joins it shows localhost as hostname in kubectl get nodes

In cluster on Google Cloud - with Fedora CoreOS Stable.
I have added the custom snippet for workers and controllers in #915 and ssh setup works fine.

I have tried the same setup with Flatcar linux and it works as expected

image

Steps to Reproduce
Normal as per the documentation google cloud cluster and as per custom worker pool from the docs

Expected behavior

Worker should join the cluster and they should have the correct host name.

Environment

  • Platform: google-cloud
  • OS: fedora-coreos
  • Release: v1.20.2
  • Terraform: v0.14.7

Possible Solution

Link to a PR or description.

@dghubble
Copy link
Member

I wasn't able to reproduce this, clusters with a worker pool of Fedora CoreOS nodes work as expected.

You'll probably need to clarify doesn't join. Do you never see them appear or are they there but NotReady? Check the Kubelet logs, be sure you're using the correct kubeconfig and other settings. If a node is reporting its hostname as localhost, check its hostname, is it still localhost? Did other units fail before the hostname was set maybe? Otherwise, restart Kubelet, a kubelet cannot register itself as localhost.

@meabed
Copy link
Author

meabed commented Feb 20, 2021

Thank you, they don't show in get nodes, and after a while 1 more node only shows "aside from the controllers" labeled with localhost, flatcar works fine - I have uploaded the bucket and created compute image and all is good.

@dghubble
Copy link
Member

Have you discovered anything different about your Fedora CoreOS nodes setup from the suggestions or the example? Or you're just switching to Flatcar Linux anyway and I'll close this out.

@meabed
Copy link
Author

meabed commented Feb 22, 2021

Thanks alot @dghubble - I will apply the example shortly and update the issue with the result, any particular info you would want to see to help debug this issue?

@dghubble
Copy link
Member

You'd need to provide clear details of why your node didn't join and show an example that is independently reproducible. Otherwise, I've already checked a GCP worker pool setup and found it to work fine. I don't provide debug/support.

@meabed
Copy link
Author

meabed commented Feb 22, 2021

Thank you, I understand.

As you see this is 5 worker nodes defined in terraform only one shows as localhost - Same setup exactly works fine in flat-car linux exactly the same.
image

I think there is a bug where those nodes workers cannot join the cluster, I have tried to get the logs from the worker nodes

2021-02-22 20:42:37.915 [INFO][47] felix/calc_graph.go 413: Local endpoint updated id=WorkloadEndpoint(node=localhost, orchestrator=k8s, workload=kube-system/coredns-7d8995c4cd-66977, name=eth0)
2021-02-22 20:42:37.916 [INFO][47] felix/conntrack.go 90: Removing conntrack flows ip=10.2.91.3
2021-02-22 20:42:37.919 [INFO][47] felix/table.go 596: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2021-02-22 20:42:37.931 [INFO][47] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-wb9fh", EndpointId:"eth0"} ipVersion=0x4 status="up"
2021-02-22 20:42:37.931 [INFO][47] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-wb9fh", EndpointId:"eth0"} status="up"
2021-02-22 20:42:37.931 [INFO][47] felix/int_dataplane.go 1467: Finished applying updates to dataplane. msecToApply=32.245952
2021-02-22 20:42:37.931 [INFO][47] felix/int_dataplane.go 1325: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"kube-system/coredns-7d8995c4cd-66977" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali805e5da4384" profile_ids:"kns.kube-system" profile_ids:"ksa.kube-system.coredns" ipv4_nets:"10.2.91.10/32" >
2021-02-22 20:42:37.932 [INFO][47] felix/int_dataplane.go 1453: Applying dataplane updates
2021-02-22 20:42:37.932 [INFO][47] felix/endpoint_mgr.go 561: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"}
2021-02-22 20:42:37.932 [INFO][47] felix/table.go 500: Queueing update of chain. chainName="cali-tw-cali805e5da4384" ipVersion=0x4 table="filter"
2021-02-22 20:42:37.932 [INFO][47] felix/table.go 934: Invalidating dataplane cache ipVersion=0x4 reason="chain update" table="filter"
2021-02-22 20:42:37.932 [INFO][47] felix/table.go 500: Queueing update of chain. chainName="cali-fw-cali805e5da4384" ipVersion=0x4 table="filter"
2021-02-22 20:42:37.932 [INFO][47] felix/table.go 500: Queueing update of chain. chainName="cali-sm-cali805e5da4384" ipVersion=0x4 table="filter"
2021-02-22 20:42:37.932 [INFO][47] felix/endpoint_mgr.go 592: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"}
2021-02-22 20:42:37.932 [INFO][47] felix/endpoint_mgr.go 1089: Applying /proc/sys configuration to interface. ifaceName="cali805e5da4384"
2021-02-22 20:42:37.932 [INFO][47] felix/endpoint_mgr.go 454: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"}
2021-02-22 20:42:37.932 [INFO][47] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"}
2021-02-22 20:42:37.937 [INFO][47] felix/conntrack.go 90: Removing conntrack flows ip=10.2.91.8
2021-02-22 20:42:37.940 [INFO][47] felix/table.go 596: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2021-02-22 20:42:37.950 [INFO][47] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"} ipVersion=0x4 status="up"
2021-02-22 20:42:37.950 [INFO][47] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"} status="up"
2021-02-22 20:42:37.950 [INFO][47] felix/int_dataplane.go 1467: Finished applying updates to dataplane. msecToApply=18.112906000000002
bird: Mesh_10_128_0_2: Socket error: bind: Address not available

image

I appreciate your support and feel free to close the issue if its not helpful.

@dghubble
Copy link
Member

A kubelet cannot register itself as localhost, nobody can talk to a remote node via localhost. If a node is reporting its hostname as localhost, check its hostname, is it still localhost? Check dmesg. Did you have other systemd units fail that prevented the hostname being set? Restart Kubelet to pickup the current hostname.

@meabed
Copy link
Author

meabed commented Feb 23, 2021

Thank you @dghubble I will try to rebuild the clusters again and restart kubelet and check.
The same setup works on aws tho without this node worker issue.
I will close the issue and re-open it after i do some debugging and have more logs.
Thanks again.

@meabed meabed closed this as completed Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants