-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AntiAffinity to kube-dns. #2693 #2705
Conversation
Hi @jamesbucher. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
7d4e901
to
e2f0267
Compare
I have verified that a clean deployment works for both Kubernetes 1.6 and 1.5 by deploying to AWS and checking the kube-scheduler.log for errors on the master node. I also made in a few typo fixes and rebased them to a single commit. |
@k8s-bot ok to test How have you tested? Which k8s versions? |
I tested for Kubernetes 1.6.0 by doing the following with a custom build of kops (with the above changes):
I am not sure of a better way to test this. The scheduler behavior is a bit out of my hands. It appears to be working correctly. There is at least API validations for 1.6.2 as it checks the YAML validity. I attempted to test this on Kubernetes 1.5.3 but the annotation was not there on the deployment. I checked the pre-1.6 yaml file in S3 and it was there... I don't really understand why its not there in the actual deployment. Is this file picked up from somewhere else? The file: https://s3-us-west-2.amazonaws.com/jamesbucher-kops/kubernetes.jamesbucher.info/addons/kube-dns.addons.k8s.io/pre-k8s-1.6.yaml from the deployment actually contains the annotation correctly but the actual deployment does not for 1.5.3. To be clear I don't think this works for Kubernetes 1.5.*. It does however work for 1.6.0. I will do some investigating tomorrow night to see why 1.5.* doesn't seem to be working. I am not sure why the 1.5 deployment didn't include the annotation and am open to ideas as to why that might be. |
We should tag this as important. If the node where the kube-dns pods are goes down... |
Just to throw in my 2 cents: how about using For example, I have a cluster on AWS that uses spot instances for the nodes. Yesterday I lost all the nodes during a crazy spot price spike. The kube-dns pods could easily end up on the same node in this situation with only soft anti-affinity rules. Maybe spot instances isn't the best example, but I fear that momentary cluster instability for whatever reason could then set up the DNS pods together and then we're back to the single point of failure scenario we're trying to avoid with this change. |
@jordanjennings : Yea I had the same thought for requiredDuringSchedulingIgnoredDuringExecution or even requiredDuringSchedulingRequiredDuringExection. My original thinking was that it would be better to have the scheduler schedule both pods (even if they are are on the same node). In case one crashes. However since I didn't see a scheduler option to have it do a sweep to move nodes once the pressure is resolved this opens things up for a node going down. I think the Spot Price Hike example is a good one. It would be better to have the scheduler favor anti-affinity over being able to schedule the pod so that you never end in a state where DNS is only on one node. The caveat to this is that clusters under pressure may end up with DNS not being scheduled. I will change this to either requiredDuringSchedulingIgnoredDuringExecution or requiredDuringSchedulingRequiredDuringExection. To me requiredDuringSchedulingRequiredDuringExection makes the most sense but we may want to wait for a PodDisruptionBudget before allowing for eviction. When: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling-for-critical-pods.md comes along we should probably add this to the DNS pods as having DNS go down is very bad. |
@jamesbucher Yes I think |
@jordanjennings Ok I will take a look at the checkins and such to see if I can determine if requiredDuringSchedulingRequiredDuringExecution is implemented yet. If it is I will use that for kubernetes 1.6 at the least and 1.5 if it supports it. Otherwise I will fallback to RequiredDuringSchedulingIgnoredDuringExectuion. First and foremost though I need to figure out why my 1.5 cluster didn't seem to have the annotation. |
OK, Figured out what was wrong with the annotation in 1.5 (it was on the wrong deployment). Also changed the annotation to I have left this as a separate commit to make it easier to see the delta between this and the previous commit. I will rebase before final merge if desired. Future work here should probably include having zone antiAffinity as well to ensure pods a spread across zones for added resiliency. |
can we use preferred instead of required? I am worried that we could run into load issues as well. If a lot of nodes are at capacity, we may not be able to schedule. Then if DNS is busy, we may not be able to scale. |
@chrislovecnm these are already marked as critical pods, doesn't that mean in a high load situation the scheduler will evict other things to make sure these are running? Or does that just mean once running they won't get evicted? |
It is the reschedule that I am concerned about. Required forces k8s to not have two dns pods on the same node. Preferred will allow it, and try hard to not have the pods. For instance a cluster of one node will not have more than one dns pod, which is not good. |
@chrislovecnm What are the "rules" for which cluster components can run on a master? If kube-dns can run on master then even in a single node scenario there would be resilience to single pod failure. If it's not OK to run dns on master, is it safe to say resilience for dns matters more on a multi-node cluster than a single node cluster? Here's what I assume:
Given those assumptions (even ignoring the last one), it makes more sense to me to require dns pods to be on separate nodes than it does to allow them to stack on a single node. Thoughts? |
@jordanjennings - Under the current placement constraints the DNS pod will NOT run on the master. Currently I think the rescheduler that will force pods to be scheduled is still somewhat in the air. There is a Priority/preemption that fits this need. Reading the comments #268 subsumes Prioritized Scheduling of System Pods @chrislovecnm - I think that Required is actually a better choice. Since there is no rescheduling enforcement if 2 DNS pods end up on the same node they may not get moved for a long time. This opens the cluster up to having DNS go down with a node. In my mind it breaks down into the following categories:
In my opinion a cluster under 100% capacity is already likely to fail to schedule DNS nodes without some sort of pod priority. Without some metric of priority the scheduler may decide to schedule a non-DNS pod instead of the second (or even first) DNS pod in the event of capacity exhaustion. I would also like to take the autoscaler into account for this feature. Since the autoscaler will scale/shrink a cluster I think it is preferable to have the DNS pod not schedule for two reasons:
Priority/Preemption + |
Reached out to @justinsb so we can get his input |
Asked in sig-scheduling on slack, cc @kubernetes/sig-scheduling-misc |
What about running kube-dns as a daemonset on master nodes, wouldn't that be simpler? |
@kenden daemonsets are sorta not HA, I think the best pattern is to run this as a deployment. |
FYI @justinsb What is the question? |
@davidopp thanks for chiming in. Is |
Thanks @chrislovecnm. I see how kube-dns as daemonset might not be HA enough with only one master. Only one master is not HA anyway. But otherwise? If daemonsets do not work, it would we nice to ensure kube-dns pods are in different AZs, if a cluster is over several AZs. |
I think we should merge for kops 1.7. I think required is the correct one because we don't have a rescheduler. The concern over preferred in a small cluster is certainly valid, but I would expect that if we didn't have many nodes we also wouldn't have as much DNS load. And I think required will surface the problem more clearly - the deployment will be at < desired. I do think that (separately) we should also offer a daemonset option for kube-dns, particularly as I think support for same-node routing is there or almost there. But for a deployment, this feels like the right change until we have a rescheduler. |
@chrislovecnm are you therefore OK to merge? |
I am extremely nervous about having this as required. Maybe we merge this and add an option to override? Also want happens to validate cluster? What happens during a rolling update? |
Sorry for the delay in responding. Most of the things people have said in this thread are correct, so I'm not sure there's much for me to add. requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity among DNS pods will keep a DNS pod pending if the only node(s) where it can schedule are already running a DNS pod. So if you have two nodes, and DNS is in a two-pod ReplicaSet, and you scale down to one node, then one DNS pod will stay pending until you scale back up such that it can find a node that is not already running a DNS pod. preferredDuringSchedulingIgnoredDuringExecution has the problem mentioned earlier that it might allow two DNS replicas to schedule onto the same node, and we don't have any mechanism today that will spread the pods if/when you add more nodes. (The rescheduler will do this, eventually.) I think the way to look at it, given the constraints of how the system works today, is
One other thing to keep in mind in general, is that all of the priority functions (including the one that implements preferredDuringScheduling) are mixed together to make the final decision of which node is "best." We currently give every factor the same weight, which means the results, while deterministic, can be hard to predict. For example, we have a priority function that prefers to put a container on a node that has the packages needed by that container already installed; this will tend to operate counter to a desire to spread pods of the same set. I'm definitely open to suggestions for tuning the weights of the default priority functions, but today they're all equal. (In practice I think if you set preferredDuringScheduling pod anti-affinity it's very likely the pods will be spread anyway, since there is also a built-in priority function that spreads pods from the same set, so in essence the spreading is being applied twice and the package thing only once.) |
So this is great, but it's clear that there are tradeoffs. In kops we try to run as close as possible to the tested kubernetes configuration, which is here, and DNS in particular has caused us a lot of problems with even small deviations. Can I ask that you PR this change to https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/kubedns-controller.yaml.base first, and then we'll pull it into kops? I'd suggest linking this issue as we've had some great discussion here and we just want to make sure that we can leverage all the e2e testing done! |
I will file a PR on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/kubedns-controller.yaml.base and link to this discussion to try and get this in there. |
Removing blocks-next as we recommended that this gets addressed upstream first. |
Marking as WIP, and please link the upstream issue / PR here |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jamesbucher Assign the PR to them by writing Associated issue: 2693 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
If we did have a daemonset on masters option wouldn't that negate the kube-dns-autoscaler? What about larger clusters where you'll never have enough masters to handle the load? |
Hi, I did as requested and made a pull request against the upstream files to revive interest, but I probably won't be able to respond to change requests in a meaningful way. If @jamesbucher or anyone wants to run with it, please take it. |
Should we close this in favor of the upstream PR then? |
Automatic merge from submit-queue (batch tested with PRs 53106, 52193, 51250, 52449, 53861). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node **What this PR does / why we need it**: This is upstreaming the kubernetes/kops#2705 pull request by @jamesbucher that was originally against [kops](github.com/kubernetes/kops). Please see kubernetes/kops#2705 for more details, including a lengthy discussion. Briefly, given the constraints of how the system works today: + if you need multiple DNS pods primarily for availability, then requiredDuringSchedulingIgnoredDuringExecution makes sense because putting more than one DNS pod on the same node isn't useful + if you need multiple DNS pods primarily for performance, then preferredDuringScheduling IgnoredDuringExecution makes sense because it will allow the DNS pods to schedule even if they can't be spread across nodes **Which issue this PR fixes** fixes kubernetes/kops#2693 **Release note**: ```release-note Improve resilience by annotating kube-dns addon with podAntiAffinity to prefer scheduling on different nodes. ```
So I got this merged upstream (with some changes). What needs to happen here? |
@StevenACoffman if @jamesbucher does not respond, can you rebase on his commit and push a new PR? He will still get credit for the PR, but may not have time to get it in. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Looks like the proposal got reverted upstream as anti-affinity was causing performance issues, but there is another proposal now after the perf issues have been resolved kubernetes/kubernetes#57683 |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Fixes: #2693
This is not quite ready as I have the following left to do:
Testing (Waiting on DNS registration stuffs so I can use Kops from my home machine).
Determine if "preferredDuringSchedulingIgnoredDuringExecution" the correct setting? Other settings would prevent the pod from scheduling on the same node and I think mark the pod as un-schedulable. Since DNS is pretty critical to a cluster it is probably better to have the scheduler attempt to schedule the pod to be resilient against DNS pod crashes.
This change is