Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow up edits on the nodelocaldns KEP #2842

Merged
merged 1 commit into from
Nov 10, 2018

Conversation

prameshj
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 24, 2018
@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/network Categorizes an issue or PR as relevant to SIG Network. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 24, 2018
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with no endpoints. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. This IP will be handled specially because of the NOTRACK rules described in the section below.
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with same endpoints as clusterDNS ip. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/coreDNS pods. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it.

Since this approach involves assigning a service IP to a dummy interface, it will not work in ipvs mode of kube-proxy. This is because kube-proxy creates a dummy interface bound to all service IPs in ipvs mode and ipvs rules are added to load-balance between endpoints. The packet seems to get dropped if there are no endpoints. If there are endpoints, adding iptables rules does not bypass the ipvs loadbalancing rules. Instead of using a service ip, a link-local ip can be selected by the user in case of ipvs mode. The --localDNS flag can be used to pass on either an ip address or a namespace/service name to support this. Kubelet will learn about the nodelocaldns ip from this flag in both cases.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m1093782566 Any other ideas on what we can do in ipvs mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd PREFER not to have 2 different mechanisms, if we can avoid it.

Would it be sufficient to simply say "pick a link-local address and pass it to --cluster-dns" ? That doesn't give you the property you described above where it falls back to cluster-scope DNS.

Using a service IP is kind of a hack -- a clever one, but a hack. It assumes things about the implementation of cluster IPs that maybe we're better off not assuming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could do that, but we don't have a good fallback option, like you pointed out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't think of a less hacky way to get the service IP approach to work. This would be viable if the kube-proxy implementation supported a nodelocal type service.
I will change this to listen on a user-specified link-local ip and reuse the --cluster-dns kubelet flag.

@prameshj
Copy link
Contributor Author

/assign @thockin

@m1093782566
Copy link

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 25, 2018
@@ -86,45 +86,109 @@ The node's resolv.conf will be used by this local agent for all other cache miss

#### Daemonset, Service, Listen Interface for caching agent

The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with no endpoints. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. This IP will be handled specially because of the NOTRACK rules described in the section below.
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with same endpoints as clusterDNS ip. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/coreDNS pods. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this require a new flag, rather than overloading the old flag? Or if we want a new flag, why is it specific to "local" rather than just --cluster-dns-server ? I know it seems like a misnomer, but "cluster" means "cluster data" not "cluster scope" here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had introduced a new flag so that if the service specified does not exist for some reason, kubelet still has a valid ip address to write in the pods' resolv.conf.

The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with no endpoints. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. This IP will be handled specially because of the NOTRACK rules described in the section below.
The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with same endpoints as clusterDNS ip. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/coreDNS pods. Kubelet takes the service name as an argument `--localDNS=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it.

Since this approach involves assigning a service IP to a dummy interface, it will not work in ipvs mode of kube-proxy. This is because kube-proxy creates a dummy interface bound to all service IPs in ipvs mode and ipvs rules are added to load-balance between endpoints. The packet seems to get dropped if there are no endpoints. If there are endpoints, adding iptables rules does not bypass the ipvs loadbalancing rules. Instead of using a service ip, a link-local ip can be selected by the user in case of ipvs mode. The --localDNS flag can be used to pass on either an ip address or a namespace/service name to support this. Kubelet will learn about the nodelocaldns ip from this flag in both cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd PREFER not to have 2 different mechanisms, if we can avoid it.

Would it be sufficient to simply say "pick a link-local address and pass it to --cluster-dns" ? That doesn't give you the property you described above where it falls back to cluster-scope DNS.

Using a service IP is kind of a hack -- a clever one, but a hack. It assumes things about the implementation of cluster IPs that maybe we're better off not assuming.


Unbound performs much better when all requests are cache hits.

coreDNS will be the local cache agent in the first release, after considering these reasons:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric @miekg @fturib

This chooses CoreDNS for reasons on ther than flat-out performance, but it would be nice to see if CoreDNS per can get better numbers - I don't see why it shouldn't be able to get much closer (e.g. within 25%) of Unbound for a properly-configured and compiled instance

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miekg
Copy link

miekg commented Oct 29, 2018 via email

@prameshj
Copy link
Contributor Author

prameshj commented Oct 29, 2018

The dns tests in kubernetes/perf-tests repo were run against coredns and unbound. I ran the tests with the changes here: kubernetes/perf-tests@master...prameshj:nodelocaldnstests

Tests were run on a GCE cluster with 3 nodes.
Initially set up coredns or unbound to run as a daemonset, yaml similar to this - https://github.com/prameshj/kubernetes/blob/ec3c50b5610cda2828146f5f1fb8800cc5f8a584/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml.base
The image in this yaml file works.

I changed the config map and image path depending on the program I was testing.

Then ran the tests using :
./run --params params/nodelocaldns/small.yaml --use-nodelocal-dns --out-dir out/

I am working on sending pull requests for both the yaml files and startup scripts in kubernetes as well as the custom coreDNS image with support to add the special iptables rules.

@johnbelamaric
Copy link
Member

johnbelamaric commented Oct 29, 2018 via email

@chrisohaver
Copy link

/uncc @chrisohaver

Copy link

@chrisohaver chrisohaver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/dismiss-review

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 31, 2018
@prameshj prameshj force-pushed the nodelocalcache-kep branch 2 times, most recently from 0e9a067 to 6c5363d Compare October 31, 2018 20:50
@prameshj
Copy link
Contributor Author

prameshj commented Nov 1, 2018

#2846 is the node-local service, so it is a bit of a race. But it won't do the NOTRACK business.

Thanks for the link, i included this as an alternate approach, in this KEP. Since we will have a nanny process, the NOTRACK rules can be added separately. It will make a difference only in iptables mode of kube-proxy, but the service approach will work for all modes.


On Mon, Oct 29, 2018 at 4:32 PM prameshj @.> wrote: @.* commented on this pull request. ------------------------------ In keps/sig-network/0030-nodelocal-dns-cache.md <#2842 (comment)>: > @@ -86,45 +86,109 @@ The node's resolv.conf will be used by this local agent for all other cache miss #### Daemonset, Service, Listen Interface for caching agent -The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with no endpoints. Kubelet takes the service name as an argument --localDNS=<namespace>/<svc name>, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. This IP will be handled specially because of the NOTRACK rules described in the section below. +The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This ip address is obtained by creating a nodelocaldns service, with same endpoints as clusterDNS ip. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/coreDNS pods. Kubelet takes the service name as an argument --localDNS=<namespace>/<svc name>, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. Each cluster node will have a dummy interface with this service ip assigned to it. + +Since this approach involves assigning a service IP to a dummy interface, it will not work in ipvs mode of kube-proxy. This is because kube-proxy creates a dummy interface bound to all service IPs in ipvs mode and ipvs rules are added to load-balance between endpoints. The packet seems to get dropped if there are no endpoints. If there are endpoints, adding iptables rules does not bypass the ipvs loadbalancing rules. Instead of using a service ip, a link-local ip can be selected by the user in case of ipvs mode. The --localDNS flag can be used to pass on either an ip address or a namespace/service name to support this. Kubelet will learn about the nodelocaldns ip from this flag in both cases. Can't think of a less hacky way to get the service IP approach to work. This would be viable if the kube-proxy implementation supported a nodelocal type service. I will change this to listen on a user-specified link-local ip and reuse the --cluster-dns kubelet flag. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2842 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AJB4s3we03jm13YeC14k8pptdJjud_Eiks5up4_9gaJpZM4X3K_B .

@prameshj
Copy link
Contributor Author

prameshj commented Nov 1, 2018

@thockin modified the KEP to use link-local ip as the listen address. Please take a look.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM. Confused about nanny


A dnscache nanny, similar to the [dnsmasq nanny](https://github.com/kubernetes/dns/tree/master/pkg/dnsmasq) will create the dummy interface and iptables rules. The nanny gets the nodelocal dns ip as a parameter. The Daemonset runs in privileged securityContext since the nanny container needs to create this dummy interface and add iptables rules.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we didn't need a nanny, since we chose CoreDNS? Is this an init container or persistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reword this - we have a single Go process that sets up the rules and invokes coredns.Run(). coredns code is imported into vendor/ - kubernetes/dns#270

CoreDNS will be the local cache agent in the first release, after considering these reasons:

* Better QPS numbers for external hostname queries
* Single process, no need for a separate nanny process
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier you talked about the nanny?

Populating both the nodelocal cache ip address and kube-dns ip address in resolv.conf is not a reliable option. Depending on underlying implementation, this can result in kube-dns being queried only if cache ip does not repond, or both queried simultaneously.
Having the pods query the nodelocal cache introduces a single point of failure.

* This is mitigated by having a livenessProbe to periodically ensure DNS is working. In case of upgrades, the recommendation is to drain the node before starting to upgrade the local instance. The user can also configure [customPodDNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config) pointing to clusterDNS ip for pods that cannot handle DNS disruption during upgrade.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it a pretty aggressive liveness probe

@thockin
Copy link
Member

thockin commented Nov 10, 2018 via email

@prameshj
Copy link
Contributor Author

Ahh, that makes sense. Fix that then LGTM

On Fri, Nov 9, 2018 at 4:10 PM prameshj @.> wrote: @.* commented on this pull request. ------------------------------ In keps/sig-network/0030-nodelocal-dns-cache.md <#2842 (comment)>: > +A dnscache nanny, similar to the dnsmasq nanny will create the dummy interface and iptables rules. The nanny gets the nodelocal dns ip as a parameter. The Daemonset runs in privileged securityContext since the nanny container needs to create this dummy interface and add iptables rules. I will reword this - we have a single Go process that sets up the rules and invokes coredns.Run(). coredns code is imported into vendor/ - kubernetes/dns#270 <kubernetes/dns#270> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2842 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVG19zrJzt7OCTMnFJZS-46PaJE3Aks5uthmPgaJpZM4X3K_B .

fixed it

@thockin
Copy link
Member

thockin commented Nov 10, 2018

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2018
@k8s-ci-robot k8s-ci-robot merged commit cbd0aef into kubernetes:master Nov 10, 2018
justaugustus pushed a commit to justaugustus/community that referenced this pull request Dec 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants