Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS not working after reboot #2383

Open
hobyte opened this issue Jul 23, 2021 · 12 comments
Open

DNS not working after reboot #2383

hobyte opened this issue Jul 23, 2021 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@hobyte
Copy link

hobyte commented Jul 23, 2021

What happened:
I created a new kind cluster, then rebooted my computer. After reboot, the dns cannot resolve adresses

What you expected to happen:

dns can resolve adresses

How to reproduce it (as minimally and precisely as possible):

  • create a new kind cluster
  • test dns: it's working
  • reboot your machine (dont't stop docker before reboot)
  • test dns again:
#APISERVER=https://kubernetes.default.svc
#SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
#NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
#TOKEN=$(cat ${SERVICEACCOUNT}/token)
#CACERT=${SERVICEACCOUNT}/ca.crt
#curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api
curl: (6) Could not resolve host: kubernetes.default.svc

Taken from https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/#without-using-a-proxy

Anything else we need to know?:

  • dns pods are running
  • dns logs:
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.0
linux/amd64, go1.15.3, 054c9ae

dns lookup:

#nslookup kubernetes.default
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1
#nslookup kubernetes.default.svc
;; connection timed out; no servers could be reached

rslov.conf:

#cat /etc/resolv.conf 
search default.svc.cluster.local svc.cluster.local cluster.local fritz.box
nameserver 10.96.

Environment:

  • kind version: (use kind version): kind v0.11.1 go1.16.4 linux/amd64
  • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version: (use docker info): Client:
    Context: default
    Debug Mode: false

Server:
Containers: 5
Running: 2
Paused: 0
Stopped: 3
Images: 11
Server Version: 20.10.6-ce
Storage Driver: btrfs
Build Version: Btrfs v4.15
Library Version: 102
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: oci runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
Default Runtime: runc
Init Binary: docker-init
containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
init version:
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.3.18-59.16-default
Operating System: openSUSE Leap 15.3
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.552GiB
Name: Proxima-Centauri
ID: M6J5:OLHQ:FXVM:M7WG:2OUA:SKGW:UCF5:DWJZ:4M7T:YA2W:6FBT:DOLG
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

  • OS (e.g. from /etc/os-release): NAME="openSUSE Leap"
    VERSION="15.3"
    ID="opensuse-leap"
    ID_LIKE="suse opensuse"
    VERSION_ID="15.3"
    PRETTY_NAME="openSUSE Leap 15.3"
    ANSI_COLOR="0;32"
    CPE_NAME="cpe:/o:opensuse:leap:15.3"
    BUG_REPORT_URL="https://bugs.opensuse.org"
    HOME_URL="https://www.opensuse.org/"
@hobyte hobyte added the kind/bug Categorizes issue or PR as related to a bug. label Jul 23, 2021
@aojea
Copy link
Contributor

aojea commented Jul 27, 2021

I assume this snippet was a copy paste error, is missin the latest 2 digits for the ip address

#cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local fritz.box
nameserver 10.96.

are you using one node or mulitple nodes in the cluster?
clusters with multiple nodes doesn't handle the reboots

@BenTheElder BenTheElder added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 29, 2021
@faiq
Copy link
Contributor

faiq commented Aug 25, 2021

Hi also running into this issue! Although I'm not sure that this is caused by a restart for me necessarily.

$  kubectl run -it --rm --restart=Never busybox1 --image=busybox sh
If you don't see a command prompt, try pressing enter.
/ # nslookup kubernetes.default
Server:		10.96.0.10
Address:	10.96.0.10:53

** server can't find kubernetes.default: NXDOMAIN

*** Can't find kubernetes.default: No answer

/ # 

Here is what I get when I inspect the kind network

$ docker network inspect kind
[
    {
        "Name": "kind",
        "Id": "7d815ef0d0c4adc297aa523aa3336ba89bc6d7212373d3098f12169618c16563",
        "Created": "2021-08-24T16:41:41.258730207-07:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": true,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                },
                {
                    "Subnet": "fc00:f853:ccd:e793::/64"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "1c47d1b38fe7b0b75e71c21c150aba4d5110ade54d74e2f3db45c5d15d013c59": {
                "Name": "konvoy-capi-bootstrapper-control-plane",
                "EndpointID": "4b176452133a1881380cae8b3fc55963ec0427ee809bc1b678d261f3c1711931",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": "fc00:f853:ccd:e793::2/64"
            }
        },
        "Options": {
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.driver.mtu": "1454"
        },
        "Labels": {}
    }
]
$ kind get nodes --name konvoy-capi-bootstrapper
konvoy-capi-bootstrapper-control-plane

output from ip addr

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether 48:2a:e3:0a:7a:8c brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 30:24:32:43:a0:e9 brd ff:ff:ff:ff:ff:ff
    inet 192.168.42.76/24 brd 192.168.42.255 scope global dynamic noprefixroute wlp2s0
       valid_lft 83634sec preferred_lft 83634sec
    inet6 fe80::c3e2:7427:34c8:c265/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
25: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1400 qdisc noqueue state DOWN group default 
    link/ether 02:42:0c:bc:be:aa brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
28: br-7d815ef0d0c4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue state UP group default 
    link/ether 02:42:08:aa:2f:bb brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-7d815ef0d0c4
       valid_lft forever preferred_lft forever
    inet6 fc00:f853:ccd:e793::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::42:8ff:feaa:2fbb/64 scope link 
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link 
       valid_lft forever preferred_lft forever
30: vethba7cc46@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue master br-7d815ef0d0c4 state UP group default 
    link/ether 82:3a:43:df:a0:c1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::803a:43ff:fedf:a0c1/64 scope link 
       valid_lft forever preferred_lft forever

finally logs from a coredns pod

35365->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:36799->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:55841->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:38716->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:51342->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:46009->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:33070->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:34194->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:56925->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:35681->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:42683->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:40842->172.18.0.1:53: i/o timeout

@AlmogBaku
Copy link
Member

AlmogBaku commented Nov 6, 2021

Hey, for us, the same issue happens after stopping/rebooting docker.
The same issue keeps reproducing with 2 different hosts @RomansWorks

Edit: we're running a single node setup, with the following config (copied from the website):

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP

@BenTheElder
Copy link
Member

@AlmogBaku I still can't reproduce this in any of our environments. We need to know more about yours.

@AlmogBaku
Copy link
Member

That usually happens after a few times I'm closing the Docker.

Both me and @RomansWorks are using macOS

@alexandresgf
Copy link

alexandresgf commented Dec 8, 2021

I have the same issue here in my dev environment... the weird thing is when I connect into the pod using bash and try nslookup the DNS works as you can see in the image below:

image

But when I try it into my application it can not be solved and everything just doesn't work... and there is no error returned (that is weird too)

image

Although, if I use the POD IP it works normally...

image

My stack is:

  • Docker 20.10.11
  • K8s 1.21.1 (kindest/node default, but I already tested with all others supported versions)
  • Kind 0.11.1 (single cluster)

NOTES:

@aojea
Copy link
Contributor

aojea commented Dec 8, 2021

@alexandresgf please don't use screenshot, those are hard to read.

Is this problem happening after reboot or it never worked?

@alexandresgf
Copy link

alexandresgf commented Dec 10, 2021

@alexandresgf please don't use screenshot, those are hard to read.

Sorry for that!

Is this problem happening after reboot or it never worked?

At first it worked for a while, then sundenlly it happened after a reboot and the DNS never worked anymore even I removing the kind completely and doing a fresh install.

@brpaz
Copy link

brpaz commented Oct 17, 2022

I got a similar problem. I created a local kind cluster and it was working fine during the entire weekend, but today, when I rebooted my PC, the dns is completely down. I tried restart docker, and even manually the CoreDNS container, but doens´t fix the issue.

I got errors like this all over my containers:

 dial tcp: lookup notification-controller.flux-system.svc.cluster.local. on 10.96.0.10:53: read udp 10.244.0.3:52830->10.96.0.10:53: read: connection refused"

And it´s not only on the internal network. Even external requests are failing with the same error.

dial tcp: lookup github.com on 10.96.0.10:53: read udp 10.244.0.15:41035->10.96.0.10:53: read: connection refused'

Any idea?

@ben-foxmoore
Copy link

I observe the same issues when using KinD in a WSL2/Windows 11 environment. Example logs from the CoreDNS pod:

[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0202 14:14:20.711784       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: network is unreachable
E0202 14:14:22.917864       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: network is unreachable

@aojea
Copy link
Contributor

aojea commented Feb 2, 2023

pkg/mod/k8s.io/client-go@v0.19.2

this is an old version, also wsl2/windows11 environments had some known issue, are you using latest version?

This bug is starting to become a placeholder, I wonder if we should close it an open more specific bugs, is not the same cluster not works after reboot in windows, that with podman, or with lima, ...

@ben-foxmoore
Copy link

Hi @aojea, which component are you saying is outdated?

I'm using kind 0.17.0 and I created the cluster using the command kind create cluster --image kindest/node:v1.21.14@sha256:9d9eb5fb26b4fbc0c6d95fa8c790414f9750dd583f5d7cee45d92e8c26670aa1 which is listed as a supported image in the 0.17.0 release.

I don't believe any of the WSL2 known issues are related to this? They all seem to be related to Docker Desktop behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

8 participants