Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS mostly fails inside application pods on brand new cluster #4391

Closed
joelittlejohn opened this issue Feb 6, 2018 · 27 comments
Closed

DNS mostly fails inside application pods on brand new cluster #4391

joelittlejohn opened this issue Feb 6, 2018 · 27 comments
Assignees
Labels
area/networking lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@joelittlejohn
Copy link

joelittlejohn commented Feb 6, 2018

Kops version: 1.8.0 (git-5099bc5)
Kubernetes version: v1.8.6 (6260bb08c46c31eea6cb538b34a9ceb3e406689c)
Cloud: AWS

The issue

When I create a brand new cluster, everything appears to be working fine. All the masters and workers are ready and I can deploy application pods. BUT, mostly the pods cannot resolve public DNS names like www.google.com, or even internal names like myservice.default. When I run ping www.google.com either the command takes a long time (over 10 seconds) and eventually says name not resolved, or it takes a long time and eventually starts pinging Google. It's as if kube-dns is failing most of the time, but not always.

Things I have noticed:

  • Only application pods have this problem, all system pods (on masters or workers) appear to be able to resolve names.
  • All nodes (masters and workers) can resolve names

Steps used to create the cluster:

  1. Created cluster configuration:
kops create cluster \
  --api-loadbalancer-type internal \
  --associate-public-ip=false \
  --cloud=aws \
  --dns private \
  --image "595879546273/CoreOS-stable-1632.2.1-hvm" \
  --master-count 3 \
  --master-size t2.small \
  --master-zones "us-east-1b,us-east-1c,us-east-1d" \
  --name=stg-us-east-1.k8s.local \
  --network-cidr 10.0.64.0/22 \
  --networking flannel \
  --node-count 5 \
  --node-size t2.small \
  --out . \
  --output json \
  --ssh-public-key ~/.ssh/mykey.pub \
  --state s3://mybucket \
  --target=terraform \
  --topology private \
  --vpc vpc-3153eb2e \
  --zones "us-east-1b,us-east-1c,us-east-1d"
  1. Modified subnets (kops edit cluster) as per https://github.com/kubernetes/kops/blob/master/docs/run_in_existing_vpc.md

  2. Updated cluster config (kops update cluster) and deployed everything (terraform apply).

Cluster manifest:

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  name: stg-us-east-1.k8s.local
spec:
  api:
    loadBalancer:
      type: Internal
  authorization:
    alwaysAllow: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://mybucket/stg-us-east-1.k8s.local
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1d
      name: d
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1d
      name: d
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.8.6
  masterInternalName: api.internal.stg-us-east-1.k8s.local
  masterPublicName: api.stg-us-east-1.k8s.local
  networkCIDR: 10.0.64.0/22
  networkID: vpc-73cfbb0a
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.0.65.0/24
    egress: nat-012ee02a09a7830d2
    id: subnet-de86e6f2
    name: us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 10.0.66.0/24
    egress: nat-012ee02a09a7830d2
    id: subnet-5fb5ef17
    name: us-east-1c
    type: Private
    zone: us-east-1c
  - cidr: 10.0.67.0/24
    egress: nat-012ee02a09a7830d2
    id: subnet-b13da5eb
    name: us-east-1d
    type: Private
    zone: us-east-1d
  - cidr: 10.0.64.32/27
    id: subnet-d68bebfa
    name: utility-us-east-1b
    type: Utility
    zone: us-east-1b
  - cidr: 10.0.64.96/27
    id: subnet-cbb0ea83
    name: utility-us-east-1c
    type: Utility
    zone: us-east-1c
  - cidr: 10.0.64.160/27
    id: subnet-f23ea6a8
    name: utility-us-east-1d
    type: Utility
    zone: us-east-1d
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  labels:
    kops.k8s.io/cluster: stg-us-east-1.k8s.local
  name: master-us-east-1b
spec:
  associatePublicIp: false
  image: 595879546273/CoreOS-stable-1632.2.1-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1b
  role: Master
  subnets:
  - us-east-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  labels:
    kops.k8s.io/cluster: stg-us-east-1.k8s.local
  name: master-us-east-1c
spec:
  associatePublicIp: false
  image: 595879546273/CoreOS-stable-1632.2.1-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1c
  role: Master
  subnets:
  - us-east-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  labels:
    kops.k8s.io/cluster: stg-us-east-1.k8s.local
  name: master-us-east-1d
spec:
  associatePublicIp: false
  image: 595879546273/CoreOS-stable-1632.2.1-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1d
  role: Master
  subnets:
  - us-east-1d

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  labels:
    kops.k8s.io/cluster: stg-us-east-1.k8s.local
  name: nodes
spec:
  associatePublicIp: false
  image: 595879546273/CoreOS-stable-1632.2.1-hvm
  machineType: t2.small
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-east-1b
  - us-east-1c
  - us-east-1d

Contents of resolv.conf on a node:

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known DNS servers.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 10.0.64.2
search ec2.internal

Contents of resolv.conf on a system pod:

nameserver 10.0.64.2
search ec2.internal

Contents of resolv.conf on an application pod:

nameserver 100.64.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

Things I have tried

  1. I've tried destroying the cluster and recreating it, this problem is always present.
  2. I've tried adding options single-request-reopen to resolv.conf on the application pods, as discussed in DNS intermittent delays of 5s kubernetes#56903 but this made no difference.
  3. I've tried removing options ndots:5 from application pod resolv.conf, as described in other places but this made no difference.
@joelittlejohn
Copy link
Author

joelittlejohn commented Feb 6, 2018

After further investigation of this, I have five workers and only two of those worker nodes have the DNS problem in any application pod they run. The two nodes that have this problem in their pods are the two nodes that are running the kube-dns pods.

@joelittlejohn
Copy link
Author

Some logs:

kubedns log (same on both kube-dns pods):

I0205 17:16:00.273097       1 dns.go:48] version: 1.14.4-2-g5584e04
I0205 17:16:00.280277       1 server.go:70] Using configuration read from directory: /kube-dns-config with period 10s
I0205 17:16:00.280336       1 server.go:113] FLAG: --alsologtostderr="false"
I0205 17:16:00.280346       1 server.go:113] FLAG: --config-dir="/kube-dns-config"
I0205 17:16:00.280353       1 server.go:113] FLAG: --config-map=""
I0205 17:16:00.280358       1 server.go:113] FLAG: --config-map-namespace="kube-system"
I0205 17:16:00.280363       1 server.go:113] FLAG: --config-period="10s"
I0205 17:16:00.280369       1 server.go:113] FLAG: --dns-bind-address="0.0.0.0"
I0205 17:16:00.280374       1 server.go:113] FLAG: --dns-port="10053"
I0205 17:16:00.280380       1 server.go:113] FLAG: --domain="cluster.local."
I0205 17:16:00.280387       1 server.go:113] FLAG: --federations=""
I0205 17:16:00.280394       1 server.go:113] FLAG: --healthz-port="8081"
I0205 17:16:00.280398       1 server.go:113] FLAG: --initial-sync-timeout="1m0s"
I0205 17:16:00.280403       1 server.go:113] FLAG: --kube-master-url=""
I0205 17:16:00.280409       1 server.go:113] FLAG: --kubecfg-file=""
I0205 17:16:00.280413       1 server.go:113] FLAG: --log-backtrace-at=":0"
I0205 17:16:00.280421       1 server.go:113] FLAG: --log-dir=""
I0205 17:16:00.280426       1 server.go:113] FLAG: --log-flush-frequency="5s"
I0205 17:16:00.280431       1 server.go:113] FLAG: --logtostderr="true"
I0205 17:16:00.280436       1 server.go:113] FLAG: --nameservers=""
I0205 17:16:00.280440       1 server.go:113] FLAG: --stderrthreshold="2"
I0205 17:16:00.280444       1 server.go:113] FLAG: --v="2"
I0205 17:16:00.280449       1 server.go:113] FLAG: --version="false"
I0205 17:16:00.280457       1 server.go:113] FLAG: --vmodule=""
I0205 17:16:00.291904       1 server.go:176] Starting SkyDNS server (0.0.0.0:10053)
I0205 17:16:00.299519       1 server.go:198] Skydns metrics enabled (/metrics:10055)
I0205 17:16:00.299529       1 dns.go:147] Starting endpointsController
I0205 17:16:00.299532       1 dns.go:150] Starting serviceController
I0205 17:16:00.300109       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0205 17:16:00.300120       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0205 17:16:00.799732       1 dns.go:171] Initialized services and endpoints from apiserver
I0205 17:16:00.799852       1 server.go:129] Setting up Healthz Handler (/readiness)
I0205 17:16:00.799874       1 server.go:134] Setting up cache handler (/cache)
I0205 17:16:00.799894       1 server.go:120] Status HTTP port 8081

dnsmasq logs (same on both kube-dns pods):

I0205 17:16:00.182836       1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0205 17:16:00.186789       1 nanny.go:86] Starting dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053]
I0205 17:16:01.053760       1 nanny.go:111] 
W0205 17:16:01.053864       1 nanny.go:112] Got EOF from stdout
I0205 17:16:01.054055       1 nanny.go:108] dnsmasq[8]: started, version 2.78-security-prerelease cachesize 1000
I0205 17:16:01.054137       1 nanny.go:108] dnsmasq[8]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0205 17:16:01.054171       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain in6.arpa 
I0205 17:16:01.054206       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0205 17:16:01.054233       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0205 17:16:01.054318       1 nanny.go:108] dnsmasq[8]: reading /etc/resolv.conf
I0205 17:16:01.054351       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain in6.arpa 
I0205 17:16:01.054378       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0205 17:16:01.054445       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0205 17:16:01.054476       1 nanny.go:108] dnsmasq[8]: using nameserver 10.0.64.2#53
I0205 17:16:01.054538       1 nanny.go:108] dnsmasq[8]: read /etc/hosts - 7 addresses

sidecar log (same on both kube-dns pods):

ERROR: logging before flag.Parse: I0205 17:16:01.163076       1 main.go:48] Version v1.14.4-2-g5584e04
ERROR: logging before flag.Parse: I0205 17:16:01.163230       1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0205 17:16:01.163269       1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0205 17:16:01.163315       1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}

@joelittlejohn joelittlejohn changed the title DNS barely works inside pods on brand new cluster DNS mostly fails inside application pods on brand new cluster Feb 6, 2018
@joelittlejohn
Copy link
Author

I rebuilt this cluster without specifying the image to use, so the cluster was built with Debian Jessie instead of CoreOS (k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-14 (ami-8ec0e1f4) instead of 595879546273/CoreOS-stable-1632.2.1-hvm (ami-a53335df)) and this problem is solved 🤔

So it seems like it's an issue in CoreOS or the provisioning code for CoreOS.

@trinitronx
Copy link

trinitronx commented Feb 6, 2018

@joelittlejohn :

I think this has to do with kubernetes/kubernetes#21613

Fix that appears to work for us is to run sudo modprobe br_netfilter on all cluster nodes.

This affected our clusters using CoreOS AMI also!

Symptoms are that DNS responses are coming from unexpected source IP. When a DNS lookup is made it sends packet to the kube-dns Service IP. Then it receives response from the Pod IP, which is then dropped because the sender doesn't know it's talking to the pod... just the Service IP.

Symptoms are (when doing lookup with dig against Service IP):

# Symptoms: dig svc-name.svc.cluster.local  returns "reply from unexpected source" error such as:
root@example-pod-64598c547d-z9vb4:/# dig @100.64.0.12 svc-name.svc.cluster.local
;; reply from unexpected source: 100.96.3.3#53, expected 100.64.0.12#53
;; reply from unexpected source: 100.96.3.3#53, expected 100.64.0.12#53
;; reply from unexpected source: 100.96.3.3#53, expected 100.64.0.12#53

@trinitronx
Copy link

Not sure where in kops provision process this should go... but it probably can be solved by dropping in a file to /etc/modules-load.d/ to load this kernel module.

echo br_netfilter > /etc/modules-load.d/br_netfilter.conf

If using cloud-init or Ignition, there are equivalent options

@joelittlejohn
Copy link
Author

@trinitronx Thanks loading this module completely fixed the problem! I had found a bunch of potential solution in the kubernetes issues list but not this one 😂

So it looks like kops for CoreOS should adding echo br_netfilter > /etc/modules-load.d/br_netfilter.conf (or something else to load this module) as part of provisioning a CoreOS cluster, because right now the CoreOS clusters that kops is creating are broken 🤔

@joelittlejohn
Copy link
Author

I fixed this in my own kops cluster by editing the cluster:

kops edit cluster stg-us-east-1.k8s.local --state s3://mybucket

and adding a hook:

  - manifest: |
      Type=oneshot
      ExecStart=/usr/sbin/modprobe br_netfilter
    name: fix-dns.service

@chrislovecnm
Copy link
Contributor

/assign @KashifSaadat @gambol99

calling the CoreOS gurus :)

@gambol99
Copy link
Contributor

gambol99 commented Feb 8, 2018

Yeah .. we've hit this one before, its an old bug (albeit not in kops but a previous installer we used and fixed with the same hack as above) ... The br-netfilter module needs to be enabled so iptables forces all packets, even those traversing the bridge, to go through the pre-routing tables .. I'm surprised the kube-proxy doesn't try and modprobe this itself, much like here.

We seem to have this enabled already, without explicitly doing a modprobe hack .. but we are using canal, so perhaps either the version of flannel, or calico it doing it for us. Let me do a quick test with CoreOS-stable-1632.2.1-hvm to rule out the os version.

core@ip-10-250-29-239 /etc/cni $ sudo lsmod | grep br_netfilter
br_netfilter           24576  0
bridge                151552  1 br_netfilter

@gambol99
Copy link
Contributor

gambol99 commented Feb 8, 2018

I noticed in the logs of CoreOS-stable-1632.2.1-hvm

Feb 08 11:26:15 ip-10-250-101-49.eu-west-2.compute.internal kernel: bridge: 
filtering via arp/ip/ip6tables is no longer available by default. Update your scripts 
to load br_netfilter if you need this

It might be worth raising on the flannel repo to get an official response ..

@trinitronx
Copy link

trinitronx commented Feb 9, 2018

Like @joelittlejohn , we were able to fix this on cluster create by adding the following hook via kops cluster edit:

Under spec:

hooks:
- name: fix-dns.service
  roles:
  - Node
  - Master
  before:
  - network-pre.target
  - kubelet.service
  manifest: |
    Type=oneshot
    ExecStart=/usr/sbin/modprobe br_netfilter
    [Unit]
    Wants=network-pre.target
    [Install]
    WantedBy=multi-user.target

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2018
@joelittlejohn
Copy link
Author

/remove-lifecycle stale

Does anyone that has commented here know if this problem would still affect newly built clusters using all the latest versions? (kops 1.9, CoreOS, Kubernetes 1.9, flannel).

I'm loathe to just close this because it seems like such a massive bug: "DNS broken on brand new cluster". It's not yet clear to me whether this should be fixed in flannel, of kubernetes or kops.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2018
@alvgarvilla
Copy link

alvgarvilla commented May 25, 2018

Still the issue but with Calico instead of flannel...

kops 1.9
kubernetes 1.9.7
CoreOS -> https://coreos.com/dist/aws/aws-stable.json for my region
Calico

Not able to resolve internal dns entries. I'll try the workaround described above

@macropin
Copy link

I can confirm that on a cluster built with Kops 1.9.0, running CoreOS 1745.4.0 (Stable) br_netfilter is not loaded on boot.

@RobertDiebels
Copy link

RobertDiebels commented Jul 21, 2018

Encountering this issue as well. Tried the 'hook' solution [ https://github.com//issues/4391#issuecomment-364321275 ] though it didn't work.

Verified by
Deploying a radial/busyboxplus pod onto the node running kube-dns. Pinged a pod on the same node, no response. Pinged a pod on a different node and response was received. Both pods were pinged on their respective service names. So, service-name.namespace.
Also tested the same resources on minikube where everything is hosted on a single node. I had no issues there.

Versioning
Cloud-provider: gce
Networking: kubenet
Kernel Version: 4.4.64+
OS Image: cos-cloud/cos-stable-60-9592-90-0
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.8.7
Kube-Proxy Version: v1.8.7
Operating system: linux
Architecture: amd64

@justinsb
Copy link
Member

re br_netfilter

  • I have been able to reproduce that force-unloading the module causes dns resolution to fail (with modprobe -r br_netfilter
  • I have not been able to reproduce a CoreOS or COS image that didn't have the module running
  • I think docker loads the module. I'm not sure why it would not load it - maybe if someone had passed some special flags to docker?
  • I don't think the module would be unloaded. Most likely is that the machine reboots and docker doesn't add it, I guess.

So I did add force-loading of the module in #5490 . Hopefully that will help with the CoreOS issue.


@RobertDiebels I'm not sure ping works very well anyway between pods. I did try with COS on GCE and wasn't able to reproduce a problem doing curl -v -k https://kubernetes, but I only realized now that you were pinging so I'll have to try that case. But I wouldn't recommend using ping as a health-check anyway.

@RobertDiebels
Copy link

RobertDiebels commented Jul 23, 2018

@justinsb Thanks for the tip. I'll try the same using curl. Will report back here today.

EDIT: I just ran my code again and everything seems to be working. This time I waited 10 minutes before doing anything. Before I waited approx. 5 minutes. So I believe my issue was due to the time it takes to initialize. Disregard earlier my comment.
EDIT2: It appears it wasn't due to the initialization time. It was due to me running 2 clusters in the same gce project. Opening a new ticket for that issue.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2018
@marblestation
Copy link

marblestation commented Oct 26, 2018

We were experiencing a different symptom but probably due to the same source problem that is reported in this issue (another related one), and we could fix it following exactly what @trinitronx shared. All the details can be found here, it would be nice if kops could automatically take care of this since it took us a large amount of time and effort to figure out what was the problem (and meanwhile, our users were affected with plenty of timeouts).

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 25, 2018
@MCLDG-zz
Copy link

MCLDG-zz commented Dec 14, 2018

Same issue on AWS, Kops v1.10.0. I have tried creating K8s clusters with both Debian Stretch (not the standard Jessie) and Amazon Linux. Both have multiple pods failing due to DNS timeouts. In fact, even one of the DNS pods is failing, though the other is fine:

kube-system kube-dns-5fbcb4d67b-kfccn 0/3 CrashLoopBackOff 53 1h

Errors:


I1214 05:23:24.685877       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
F1214 05:23:25.185865       1 dns.go:209] Timeout waiting for initialization

I've tried the 'hook' solution without success.

UPDATE

Just tried this again after upgrading to Kops beta v1.11 and rebuilding a clean cluster. My cluster is now working without any DNS issues.

curl -Lo kops https://github.com/kubernetes/kops/releases/download/1.11.0-beta.1/kops-linux-amd64
chmod +x ./kops
sudo mv ./kops /usr/local/bin/

@michaelajr
Copy link

I have been unable to get DNS working in a new cluster using Kops 1.11 beta, stable channel. I've tried kube-dns and core-dns with weave as the overlay. No luck. I've tried the hook fix. No luck.

@MCLDG What AMI, DNS provider, and overlay are you using?

@michaelajr
Copy link

Hook fix did not work because...um, the module was already loaded, so that makes sense. So maybe I have a different issue. Been unable to get DNS working reliably. Sometimes it will work after a long delay. Then fail on the same lookup. Sometimes it times out. Sometimes if I kill one of the DNS servers (take replicas down to 1) - things work perfect! Then when I try to reproduce in a new cluster, taking down to one pod does NOT work. Super frustrating not being able to reproduce the issue reliably (or reproduce a fix reliably).

@MCLDG-zz
Copy link

@michaelajr , my 'create cluster' statement looks as follows. Default AMI, built-in K8s DNS and AWS VPC networking.

kops create cluster \
    --node-count 2 \
    --zones ap-southeast-1a,ap-southeast-1b,ap-southeast-1c \
    --master-zones ap-southeast-1a,ap-southeast-1b,ap-southeast-1c \
    --node-size m5.large\
    --master-size t2.medium \
    --topology private \
    --networking amazon-vpc-routed-eni  \
    ${NAME}

Some of my DNS calls would succeed. The pattern that seemed to work was where pod A called pod B and both were on the same worker node. If the pods were on different nodes the call would fail, though this wasn't consistent.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests