single node cluster crashes when node reboot (container IP changing makes kubeconf out of date) #3071

panpan0000 · 2023-01-17T01:47:00Z

Impact

For a single node cluster, when host OS reboot, the cluster failed to work , e.g. : the kube-controller-manager will complain with logs

leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: 
Get "https://10.89.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": 
dial tcp 10.89.0.1:6443: connect: no route to host
....(repeating)

NOTE: remember the above IP 10.89.0.1..

Root Cause

(1) When host OS reboot, the IP address of eth0 inside kind container tends to be different than last time:

podman runtime: everytime os reboot or podman restart, the podman network will assign a "+1" IP to container. For example, first time IP is 10.89.0.1, the next time will be 10.89.0.2 , and so on.
docker runtime: there's still some chance the kind container being assign the previous IP, it depends to luck?

(2) When there's only one node for kind cluster, the below 2 kube-config will use nodeIP instead of localhost.

/etc/kubernetes/controller-manager.conf
/etc/kubernetes/scheduler.conf

In above conf files, content looks like:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: *
    server: https://10.89.0.1:6443           # <-------------- it uses the "container IP" here
  name: **

Workaround

I will have to use "sed" to replace the IP from container IP to 127.0.0.1 , right after kind cluster gets ready.
so every time container restarts or os reboot, k8s cluster can survive .

I haven't tested for multiple node situation, in theory , the result will be worse and solution may be more complicated.

Fixes (?)

I propose when kind itself detects only one node for this cluster, below code AdvertiseAddress can use localhost instead.

c.AdvertiseAddress = strings.Split(c.NodeAddress, ",")[0]

But again, multiple node situation might be harder.

My two cents, just to set the ball rolling...

The text was updated successfully, but these errors were encountered:

panpan0000 · 2023-01-17T01:55:39Z

somebody asked if below kind config can help:

kind: Cluster 
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  apiServerAddress: "127.0.0.1"

Actually, it won't.

This field only changes the kubeadm certSANs: https://github.com/kubernetes-sigs/kind/blob/main/pkg/cluster/internal/kubeadm/config.go#L193

I tried it, issue remains.

BenTheElder · 2023-01-17T18:07:16Z

KIND + Podman doesn't support restart. KIND + Docker does.

See: #2272, the existing tracking issue.

The problem with just switching to new IPs is we have no way to automatically discover the new peer IPs until podman supports equivalent embedded DNS to docker or someone proposes an alternate approach.

panpan0000 · 2023-03-12T03:43:05Z

for single node kind cluster, you can just resolve this problem by :

podman exec -it ${your_container_man}    bash -c  "sed -i 's/server: .*:6443/server: https:\/\/127.0.0.1:6443/g' /etc/kubernetes/controller-manager.conf"
podman exec -it  ${your_container_man}     bash -c  "sed -i 's/server: .*:6443/server: https:\/\/127.0.0.1:6443/g' /etc/kubernetes/scheduler.conf "

panpan0000 added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 17, 2023

BenTheElder closed this as completed Jan 17, 2023

BenTheElder added triage/duplicate Indicates an issue is a duplicate of other open issue. area/provider/podman Issues or PRs related to podman labels Jan 17, 2023

hangscer8 mentioned this issue Aug 9, 2023

Podman Restart Kind But SIGKILL may occur #3325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single node cluster crashes when node reboot (container IP changing makes kubeconf out of date) #3071

single node cluster crashes when node reboot (container IP changing makes kubeconf out of date) #3071

panpan0000 commented Jan 17, 2023 •

edited

Loading

panpan0000 commented Jan 17, 2023 •

edited

Loading

BenTheElder commented Jan 17, 2023

panpan0000 commented Mar 12, 2023

single node cluster crashes when node reboot (container IP changing makes kubeconf out of date) #3071

single node cluster crashes when node reboot (container IP changing makes kubeconf out of date) #3071

Comments

panpan0000 commented Jan 17, 2023 • edited Loading

Impact

Root Cause

Workaround

Fixes (?)

panpan0000 commented Jan 17, 2023 • edited Loading

BenTheElder commented Jan 17, 2023

panpan0000 commented Mar 12, 2023

panpan0000 commented Jan 17, 2023 •

edited

Loading

panpan0000 commented Jan 17, 2023 •

edited

Loading