-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follow up edits on the nodelocaldns KEP #2842
Conversation
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with no endpoints. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. This IP will be handled specially because of the NOTRACK rules described in the section below. | ||
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with same endpoints as clusterDNS ip. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/coreDNS pods. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. | ||
|
||
Since this approach involves assigning a service IP to a dummy interface, it will not work in ipvs mode of kube-proxy. This is because kube-proxy creates a dummy interface bound to all service IPs in ipvs mode and ipvs rules are added to load-balance between endpoints. The packet seems to get dropped if there are no endpoints. If there are endpoints, adding iptables rules does not bypass the ipvs loadbalancing rules. Instead of using a service ip, a link-local ip can be selected by the user in case of ipvs mode. The --localDNS flag can be used to pass on either an ip address or a namespace/service name to support this. Kubelet will learn about the nodelocaldns ip from this flag in both cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@m1093782566 Any other ideas on what we can do in ipvs mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd PREFER not to have 2 different mechanisms, if we can avoid it.
Would it be sufficient to simply say "pick a link-local address and pass it to --cluster-dns" ? That doesn't give you the property you described above where it falls back to cluster-scope DNS.
Using a service IP is kind of a hack -- a clever one, but a hack. It assumes things about the implementation of cluster IPs that maybe we're better off not assuming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could do that, but we don't have a good fallback option, like you pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't think of a less hacky way to get the service IP approach to work. This would be viable if the kube-proxy implementation supported a nodelocal type service.
I will change this to listen on a user-specified link-local ip and reuse the --cluster-dns kubelet flag.
/assign @thockin |
/ok-to-test |
@@ -86,45 +86,109 @@ The node's resolv.conf will be used by this local agent for all other cache miss | |||
|
|||
#### Daemonset, Service, Listen Interface for caching agent | |||
|
|||
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with no endpoints. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. This IP will be handled specially because of the NOTRACK rules described in the section below. | |||
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with same endpoints as clusterDNS ip. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/coreDNS pods. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this require a new flag, rather than overloading the old flag? Or if we want a new flag, why is it specific to "local" rather than just --cluster-dns-server ? I know it seems like a misnomer, but "cluster" means "cluster data" not "cluster scope" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had introduced a new flag so that if the service specified does not exist for some reason, kubelet still has a valid ip address to write in the pods' resolv.conf.
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with no endpoints. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. This IP will be handled specially because of the NOTRACK rules described in the section below. | ||
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with same endpoints as clusterDNS ip. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/coreDNS pods. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. | ||
|
||
Since this approach involves assigning a service IP to a dummy interface, it will not work in ipvs mode of kube-proxy. This is because kube-proxy creates a dummy interface bound to all service IPs in ipvs mode and ipvs rules are added to load-balance between endpoints. The packet seems to get dropped if there are no endpoints. If there are endpoints, adding iptables rules does not bypass the ipvs loadbalancing rules. Instead of using a service ip, a link-local ip can be selected by the user in case of ipvs mode. The --localDNS flag can be used to pass on either an ip address or a namespace/service name to support this. Kubelet will learn about the nodelocaldns ip from this flag in both cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd PREFER not to have 2 different mechanisms, if we can avoid it.
Would it be sufficient to simply say "pick a link-local address and pass it to --cluster-dns" ? That doesn't give you the property you described above where it falls back to cluster-scope DNS.
Using a service IP is kind of a hack -- a clever one, but a hack. It assumes things about the implementation of cluster IPs that maybe we're better off not assuming.
|
||
Unbound performs much better when all requests are cache hits. | ||
|
||
coreDNS will be the local cache agent in the first release, after considering these reasons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This chooses CoreDNS for reasons on ther than flat-out performance, but it would be nice to see if CoreDNS per can get better numbers - I don't see why it shouldn't be able to get much closer (e.g. within 25%) of Unbound for a properly-configured and compiled instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @chrisohaver
[ Quoting <notifications@github.com> in "Re: [kubernetes/community] Follow u..." ]
+coreDNS will be the local cache agent in the first release, after considering these reasons:
@johnbelamaric @miekg @fturib
This chooses CoreDNS for reasons on ther than flat-out performance, but it would be nice to see if CoreDNS per can get better numbers - I don't see why it shouldn't be able to get much closer (e.g. within 25%) of Unbound for a properly-configured and compiled instance
Are there any details on how this was carried out?
|
The dns tests in kubernetes/perf-tests repo were run against coredns and unbound. I ran the tests with the changes here: kubernetes/perf-tests@master...prameshj:nodelocaldnstests Tests were run on a GCE cluster with 3 nodes. I changed the config map and image path depending on the program I was testing. Then ran the tests using : I am working on sending pull requests for both the yaml files and startup scripts in kubernetes as well as the custom coreDNS image with support to add the special iptables rules. |
#2846 is the node-local
service, so it is a bit of a race. But it won't do the NOTRACK business.
…On Mon, Oct 29, 2018 at 4:32 PM prameshj ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In keps/sig-network/0030-nodelocal-dns-cache.md
<#2842 (comment)>:
> @@ -86,45 +86,109 @@ The node's resolv.conf will be used by this local agent for all other cache miss
#### Daemonset, Service, Listen Interface for caching agent
-The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with no endpoints. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. This IP will be handled specially because of the NOTRACK rules described in the section below.
+The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with same endpoints as clusterDNS ip. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/coreDNS pods. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it.
+
+Since this approach involves assigning a service IP to a dummy interface, it will not work in ipvs mode of kube-proxy. This is because kube-proxy creates a dummy interface bound to all service IPs in ipvs mode and ipvs rules are added to load-balance between endpoints. The packet seems to get dropped if there are no endpoints. If there are endpoints, adding iptables rules does not bypass the ipvs loadbalancing rules. Instead of using a service ip, a link-local ip can be selected by the user in case of ipvs mode. The --localDNS flag can be used to pass on either an ip address or a namespace/service name to support this. Kubelet will learn about the nodelocaldns ip from this flag in both cases.
Can't think of a less hacky way to get the service IP approach to work.
This would be viable if the kube-proxy implementation supported a nodelocal
type service.
I will change this to listen on a user-specified link-local ip and reuse
the --cluster-dns kubelet flag.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2842 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJB4s3we03jm13YeC14k8pptdJjud_Eiks5up4_9gaJpZM4X3K_B>
.
|
/uncc @chrisohaver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/dismiss-review
48c8942
to
d0358c8
Compare
0e9a067
to
6c5363d
Compare
Thanks for the link, i included this as an alternate approach, in this KEP. Since we will have a nanny process, the NOTRACK rules can be added separately. It will make a difference only in iptables mode of kube-proxy, but the service approach will work for all modes.
|
@thockin modified the KEP to use link-local ip as the listen address. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly LGTM. Confused about nanny
|
||
A dnscache nanny, similar to the [dnsmasq nanny](https://github.com/kubernetes/dns/tree/master/pkg/dnsmasq) will create the dummy interface and iptables rules. The nanny gets the nodelocal dns ip as a parameter. The Daemonset runs in privileged securityContext since the nanny container needs to create this dummy interface and add iptables rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we didn't need a nanny, since we chose CoreDNS? Is this an init container or persistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will reword this - we have a single Go process that sets up the rules and invokes coredns.Run(). coredns code is imported into vendor/ - kubernetes/dns#270
CoreDNS will be the local cache agent in the first release, after considering these reasons: | ||
|
||
* Better QPS numbers for external hostname queries | ||
* Single process, no need for a separate nanny process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier you talked about the nanny?
Populating both the nodelocal cache ip address and kube-dns ip address in resolv.conf is not a reliable option. Depending on underlying implementation, this can result in kube-dns being queried only if cache ip does not repond, or both queried simultaneously. | ||
Having the pods query the nodelocal cache introduces a single point of failure. | ||
|
||
* This is mitigated by having a livenessProbe to periodically ensure DNS is working. In case of upgrades, the recommendation is to drain the node before starting to upgrade the local instance. The user can also configure [customPodDNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config) pointing to clusterDNS ip for pods that cannot handle DNS disruption during upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make it a pretty aggressive liveness probe
Ahh, that makes sense. Fix that then LGTM
…On Fri, Nov 9, 2018 at 4:10 PM prameshj ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In keps/sig-network/0030-nodelocal-dns-cache.md
<#2842 (comment)>:
>
+A dnscache nanny, similar to the [dnsmasq nanny](https://github.com/kubernetes/dns/tree/master/pkg/dnsmasq) will create the dummy interface and iptables rules. The nanny gets the nodelocal dns ip as a parameter. The Daemonset runs in privileged securityContext since the nanny container needs to create this dummy interface and add iptables rules.
I will reword this - we have a single Go process that sets up the rules
and invokes coredns.Run(). coredns code is imported into vendor/ -
kubernetes/dns#270 <kubernetes/dns#270>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2842 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVG19zrJzt7OCTMnFJZS-46PaJE3Aks5uthmPgaJpZM4X3K_B>
.
|
6c5363d
to
ebe6022
Compare
fixed it |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Follow up edits on the nodelocaldns KEP
@thockin @bowei @johnbelamaric