-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS mostly fails inside application pods on brand new cluster #4391
Comments
After further investigation of this, I have five workers and only two of those worker nodes have the DNS problem in any application pod they run. The two nodes that have this problem in their pods are the two nodes that are running the kube-dns pods. |
Some logs: kubedns log (same on both kube-dns pods):
dnsmasq logs (same on both kube-dns pods):
sidecar log (same on both kube-dns pods):
|
I rebuilt this cluster without specifying the image to use, so the cluster was built with Debian Jessie instead of CoreOS ( So it seems like it's an issue in CoreOS or the provisioning code for CoreOS. |
I think this has to do with kubernetes/kubernetes#21613 Fix that appears to work for us is to run This affected our clusters using CoreOS AMI also! Symptoms are that DNS responses are coming from unexpected source IP. When a DNS lookup is made it sends packet to the Symptoms are (when doing lookup with
|
Not sure where in
If using |
@trinitronx Thanks loading this module completely fixed the problem! I had found a bunch of potential solution in the kubernetes issues list but not this one 😂 So it looks like kops for CoreOS should adding |
I fixed this in my own kops cluster by editing the cluster:
and adding a hook:
|
/assign @KashifSaadat @gambol99 calling the CoreOS gurus :) |
Yeah .. we've hit this one before, its an old bug (albeit not in kops but a previous installer we used and fixed with the same hack as above) ... The br-netfilter module needs to be enabled so iptables forces all packets, even those traversing the bridge, to go through the pre-routing tables .. I'm surprised the kube-proxy doesn't try and modprobe this itself, much like here. We seem to have this enabled already, without explicitly doing a modprobe hack .. but we are using canal, so perhaps either the version of flannel, or calico it doing it for us. Let me do a quick test with CoreOS-stable-1632.2.1-hvm to rule out the os version. core@ip-10-250-29-239 /etc/cni $ sudo lsmod | grep br_netfilter
br_netfilter 24576 0
bridge 151552 1 br_netfilter |
I noticed in the logs of CoreOS-stable-1632.2.1-hvm
It might be worth raising on the flannel repo to get an official response .. |
Like @joelittlejohn , we were able to fix this on cluster create by adding the following hook via Under
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale Does anyone that has commented here know if this problem would still affect newly built clusters using all the latest versions? (kops 1.9, CoreOS, Kubernetes 1.9, flannel). I'm loathe to just close this because it seems like such a massive bug: "DNS broken on brand new cluster". It's not yet clear to me whether this should be fixed in flannel, of kubernetes or kops. |
Still the issue but with Calico instead of flannel... kops 1.9 Not able to resolve internal dns entries. I'll try the workaround described above |
I can confirm that on a cluster built with Kops 1.9.0, running CoreOS |
Encountering this issue as well. Tried the 'hook' solution [ https://github.com//issues/4391#issuecomment-364321275 ] though it didn't work. Verified by Versioning |
re br_netfilter
So I did add force-loading of the module in #5490 . Hopefully that will help with the CoreOS issue. @RobertDiebels I'm not sure ping works very well anyway between pods. I did try with COS on GCE and wasn't able to reproduce a problem doing |
@justinsb Thanks for the tip. I'll try the same using curl. Will report back here today. EDIT: |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
We were experiencing a different symptom but probably due to the same source problem that is reported in this issue (another related one), and we could fix it following exactly what @trinitronx shared. All the details can be found here, it would be nice if kops could automatically take care of this since it took us a large amount of time and effort to figure out what was the problem (and meanwhile, our users were affected with plenty of timeouts). |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Same issue on AWS, Kops v1.10.0. I have tried creating K8s clusters with both Debian Stretch (not the standard Jessie) and Amazon Linux. Both have multiple pods failing due to DNS timeouts. In fact, even one of the DNS pods is failing, though the other is fine:
Errors:
I've tried the 'hook' solution without success. UPDATEJust tried this again after upgrading to Kops beta v1.11 and rebuilding a clean cluster. My cluster is now working without any DNS issues.
|
I have been unable to get DNS working in a new cluster using Kops 1.11 beta, stable channel. I've tried kube-dns and core-dns with weave as the overlay. No luck. I've tried the hook fix. No luck. @MCLDG What AMI, DNS provider, and overlay are you using? |
Hook fix did not work because...um, the module was already loaded, so that makes sense. So maybe I have a different issue. Been unable to get DNS working reliably. Sometimes it will work after a long delay. Then fail on the same lookup. Sometimes it times out. Sometimes if I kill one of the DNS servers (take replicas down to 1) - things work perfect! Then when I try to reproduce in a new cluster, taking down to one pod does NOT work. Super frustrating not being able to reproduce the issue reliably (or reproduce a fix reliably). |
@michaelajr , my 'create cluster' statement looks as follows. Default AMI, built-in K8s DNS and AWS VPC networking.
Some of my DNS calls would succeed. The pattern that seemed to work was where pod A called pod B and both were on the same worker node. If the pods were on different nodes the call would fail, though this wasn't consistent. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Kops version: 1.8.0 (git-5099bc5)
Kubernetes version: v1.8.6 (6260bb08c46c31eea6cb538b34a9ceb3e406689c)
Cloud: AWS
The issue
When I create a brand new cluster, everything appears to be working fine. All the masters and workers are ready and I can deploy application pods. BUT, mostly the pods cannot resolve public DNS names like www.google.com, or even internal names like
myservice.default
. When I runping www.google.com
either the command takes a long time (over 10 seconds) and eventually says name not resolved, or it takes a long time and eventually starts pinging Google. It's as if kube-dns is failing most of the time, but not always.Things I have noticed:
Steps used to create the cluster:
Modified subnets (
kops edit cluster
) as per https://github.com/kubernetes/kops/blob/master/docs/run_in_existing_vpc.mdUpdated cluster config (
kops update cluster
) and deployed everything (terraform apply
).Cluster manifest:
Contents of resolv.conf on a node:
Contents of resolv.conf on a system pod:
Contents of resolv.conf on an application pod:
Things I have tried
options single-request-reopen
to resolv.conf on the application pods, as discussed in DNS intermittent delays of 5s kubernetes#56903 but this made no difference.options ndots:5
from application pod resolv.conf, as described in other places but this made no difference.The text was updated successfully, but these errors were encountered: