Google Cloud and Fedora CoreOS Custom worker nodes doesn't join the cluster #942

meabed · 2021-02-19T10:03:41Z

Description
But Custom worker nodes doesn't join the cluster, and some workers node if it joins it shows localhost as hostname in kubectl get nodes

In cluster on Google Cloud - with Fedora CoreOS Stable.
I have added the custom snippet for workers and controllers in #915 and ssh setup works fine.

I have tried the same setup with Flatcar linux and it works as expected

Steps to Reproduce
Normal as per the documentation google cloud cluster and as per custom worker pool from the docs

Expected behavior

Worker should join the cluster and they should have the correct host name.

Environment

Platform: google-cloud
OS: fedora-coreos
Release: v1.20.2
Terraform: v0.14.7

Possible Solution

Link to a PR or description.

The text was updated successfully, but these errors were encountered:

dghubble · 2021-02-20T01:19:42Z

I wasn't able to reproduce this, clusters with a worker pool of Fedora CoreOS nodes work as expected.

You'll probably need to clarify doesn't join. Do you never see them appear or are they there but NotReady? Check the Kubelet logs, be sure you're using the correct kubeconfig and other settings. If a node is reporting its hostname as localhost, check its hostname, is it still localhost? Did other units fail before the hostname was set maybe? Otherwise, restart Kubelet, a kubelet cannot register itself as localhost.

meabed · 2021-02-20T06:39:12Z

Thank you, they don't show in get nodes, and after a while 1 more node only shows "aside from the controllers" labeled with localhost, flatcar works fine - I have uploaded the bucket and created compute image and all is good.

dghubble · 2021-02-20T19:54:48Z

Have you discovered anything different about your Fedora CoreOS nodes setup from the suggestions or the example? Or you're just switching to Flatcar Linux anyway and I'll close this out.

meabed · 2021-02-22T18:17:07Z

Thanks alot @dghubble - I will apply the example shortly and update the issue with the result, any particular info you would want to see to help debug this issue?

dghubble · 2021-02-22T19:20:30Z

You'd need to provide clear details of why your node didn't join and show an example that is independently reproducible. Otherwise, I've already checked a GCP worker pool setup and found it to work fine. I don't provide debug/support.

meabed · 2021-02-22T20:49:25Z

Thank you, I understand.

As you see this is 5 worker nodes defined in terraform only one shows as localhost - Same setup exactly works fine in flat-car linux exactly the same.

I think there is a bug where those nodes workers cannot join the cluster, I have tried to get the logs from the worker nodes

2021-02-22 20:42:37.915 [INFO][47] felix/calc_graph.go 413: Local endpoint updated id=WorkloadEndpoint(node=localhost, orchestrator=k8s, workload=kube-system/coredns-7d8995c4cd-66977, name=eth0)
2021-02-22 20:42:37.916 [INFO][47] felix/conntrack.go 90: Removing conntrack flows ip=10.2.91.3
2021-02-22 20:42:37.919 [INFO][47] felix/table.go 596: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2021-02-22 20:42:37.931 [INFO][47] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-wb9fh", EndpointId:"eth0"} ipVersion=0x4 status="up"
2021-02-22 20:42:37.931 [INFO][47] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-wb9fh", EndpointId:"eth0"} status="up"
2021-02-22 20:42:37.931 [INFO][47] felix/int_dataplane.go 1467: Finished applying updates to dataplane. msecToApply=32.245952
2021-02-22 20:42:37.931 [INFO][47] felix/int_dataplane.go 1325: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"kube-system/coredns-7d8995c4cd-66977" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali805e5da4384" profile_ids:"kns.kube-system" profile_ids:"ksa.kube-system.coredns" ipv4_nets:"10.2.91.10/32" >
2021-02-22 20:42:37.932 [INFO][47] felix/int_dataplane.go 1453: Applying dataplane updates
2021-02-22 20:42:37.932 [INFO][47] felix/endpoint_mgr.go 561: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"}
2021-02-22 20:42:37.932 [INFO][47] felix/table.go 500: Queueing update of chain. chainName="cali-tw-cali805e5da4384" ipVersion=0x4 table="filter"
2021-02-22 20:42:37.932 [INFO][47] felix/table.go 934: Invalidating dataplane cache ipVersion=0x4 reason="chain update" table="filter"
2021-02-22 20:42:37.932 [INFO][47] felix/table.go 500: Queueing update of chain. chainName="cali-fw-cali805e5da4384" ipVersion=0x4 table="filter"
2021-02-22 20:42:37.932 [INFO][47] felix/table.go 500: Queueing update of chain. chainName="cali-sm-cali805e5da4384" ipVersion=0x4 table="filter"
2021-02-22 20:42:37.932 [INFO][47] felix/endpoint_mgr.go 592: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"}
2021-02-22 20:42:37.932 [INFO][47] felix/endpoint_mgr.go 1089: Applying /proc/sys configuration to interface. ifaceName="cali805e5da4384"
2021-02-22 20:42:37.932 [INFO][47] felix/endpoint_mgr.go 454: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"}
2021-02-22 20:42:37.932 [INFO][47] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"}
2021-02-22 20:42:37.937 [INFO][47] felix/conntrack.go 90: Removing conntrack flows ip=10.2.91.8
2021-02-22 20:42:37.940 [INFO][47] felix/table.go 596: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2021-02-22 20:42:37.950 [INFO][47] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"} ipVersion=0x4 status="up"
2021-02-22 20:42:37.950 [INFO][47] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/coredns-7d8995c4cd-66977", EndpointId:"eth0"} status="up"
2021-02-22 20:42:37.950 [INFO][47] felix/int_dataplane.go 1467: Finished applying updates to dataplane. msecToApply=18.112906000000002
bird: Mesh_10_128_0_2: Socket error: bind: Address not available

I appreciate your support and feel free to close the issue if its not helpful.

dghubble · 2021-02-22T21:58:25Z

A kubelet cannot register itself as localhost, nobody can talk to a remote node via localhost. If a node is reporting its hostname as localhost, check its hostname, is it still localhost? Check dmesg. Did you have other systemd units fail that prevented the hostname being set? Restart Kubelet to pickup the current hostname.

meabed · 2021-02-23T19:47:13Z

Thank you @dghubble I will try to rebuild the clusters again and restart kubelet and check.
The same setup works on aws tho without this node worker issue.
I will close the issue and re-open it after i do some debugging and have more logs.
Thanks again.

meabed closed this as completed Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Cloud and Fedora CoreOS Custom worker nodes doesn't join the cluster #942

Google Cloud and Fedora CoreOS Custom worker nodes doesn't join the cluster #942

meabed commented Feb 19, 2021

dghubble commented Feb 20, 2021

meabed commented Feb 20, 2021

dghubble commented Feb 20, 2021

meabed commented Feb 22, 2021

dghubble commented Feb 22, 2021

meabed commented Feb 22, 2021

dghubble commented Feb 22, 2021

meabed commented Feb 23, 2021

Google Cloud and Fedora CoreOS Custom worker nodes doesn't join the cluster #942

Google Cloud and Fedora CoreOS Custom worker nodes doesn't join the cluster #942

Comments

meabed commented Feb 19, 2021

dghubble commented Feb 20, 2021

meabed commented Feb 20, 2021

dghubble commented Feb 20, 2021

meabed commented Feb 22, 2021

dghubble commented Feb 22, 2021

meabed commented Feb 22, 2021

dghubble commented Feb 22, 2021

meabed commented Feb 23, 2021