Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit pod node anti-affinity rules #197

Closed
jmcmeek opened this issue Sep 11, 2019 · 9 comments · Fixed by #206
Closed

Revisit pod node anti-affinity rules #197

jmcmeek opened this issue Sep 11, 2019 · 9 comments · Fixed by #206
Milestone

Comments

@jmcmeek
Copy link

jmcmeek commented Sep 11, 2019

pod anti-affinity rules were revoked in #60 due to Kubernetes scheduler performance concerns.

Scheduler performance enhancements have been made since then and I think it is safe to use those now. See my question and reply on #sig-scheduler Slack channel
Q: https://kubernetes.slack.com/archives/C09TP78DV/p1568127000040100
R: https://kubernetes.slack.com/archives/C09TP78DV/p1568128953041900

@jmcmeek
Copy link
Author

jmcmeek commented Sep 11, 2019

Also, should this include zone anti-affinty? Something like

      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values: ["kube-dns"]
              topologyKey: failure-domain.beta.kubernetes.io/zone
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values: ["kube-dns"]
              topologyKey: "kubernetes.io/hostname"

@chrisohaver
Copy link
Member

chrisohaver commented Sep 11, 2019

The docs still warn against using them. (I do realize that k8s docs are not always updated proactively)

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity

Note: Inter-pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. We do not recommend using them in clusters larger than several hundred nodes.

Do you know which PR or commit(s) fixed it? I searched closed k/k PRs, and found one from April 2019, but that doesn't line up with the 1.13 release date (mentioned in the slack channel).

@jmcmeek
Copy link
Author

jmcmeek commented Sep 11, 2019

Scouring change logs and git history...

Kubernetes change logs suggest 1.11 and 1.12 had work done for anti-affinity scheduler performance - it looks like 1.11 would have been the bigger.

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md#sig-scheduling-1

  • Performance of the anti-affinity predicate of the default scheduler has been improved. (#66948, @mohamed-mehany)

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#sig-scheduling-1

  • Performance of the affinity/anti-affinity predicate for the default scheduler has been significantly improved. (#62211, @bsalamat)

That note was added back in Oct 2017.
https://github.com/kubernetes/website/blame/master/content/en/docs/concepts/configuration/assign-pod-node.md#L182
kubernetes/website#5979

I think that's before the original kube-dns anti-affinity change in Dec 2017:
kubernetes/kubernetes#57683

And the 1.11 change ( #62211) dates to Apr 2013 - after anti-affinity had been pulled out.

I'll ask sig-scheduling if that warning is obsolete.

@chrisohaver
Copy link
Member

Thanks @jmcmeek !

Not sure about the 50 weights ... or how to mix zone and node anti-affinity correctly.
Perhaps using an empty topology key would work just as well, and be shorter...

From https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity

For preferredDuringSchedulingIgnoredDuringExecution pod anti-affinity, empty topologyKey is interpreted as “all topologies” (“all topologies” here is now limited to the combination of kubernetes.io/hostname, failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region).

@jmcmeek
Copy link
Author

jmcmeek commented Sep 11, 2019

Checked with sig-scheduling about that warning: at https://kubernetes.slack.com/archives/C09TP78DV/p1568225168053800

Huang-Wei  16 minutes ago
that's a bit obsolete
Huang-Wei  14 minutes ago
but I want to ensure that although we did some improvement, the potential overhead still exists which is inevitable
Huang-Wei  13 minutes ago
in terms of cluster size, in some Github issue, some users mentioned they are using PodAffinity in 5k node cluster

@chrisohaver
Copy link
Member

chrisohaver commented Sep 11, 2019

OK Thanks, we can add it and see if it passes the 5k k8s scale tests. We can add it here now, but as for merging into k8s (e.g. kubeadm/kubeup) 1.16 is already in code freeze, so it wont actually get scale tested at 5k nodes until later on in the k8s 1.17 release.

@jmcmeek
Copy link
Author

jmcmeek commented Sep 11, 2019

@chrisohaver I did limited testing with the snippet I posted. It "worked for me".

failure-domain.beta.kubernetes.io/zone might not always be present - a cluster kubeadm - so maybe it would be best to not do that.

Note: Pod anti-affinity requires nodes to be consistently labelled, i.e. every node in the cluster must have an appropriate label matching topologyKey. If some or all nodes are missing the specified topologyKey label, it can lead to unintended behavior.

I'll have to checkout the empty topologyKey idea. It sounds promising.

@jmcmeek
Copy link
Author

jmcmeek commented Sep 11, 2019

On the surface, empty topologyKey does not seem to be allowed.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.15/#podaffinityterm-v1-core

topologyKey string This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.

I found this in Kubernetes 1.8 changelog suggesting the note about using an empty topologyKey is old.

  • Affinity in annotations alpha feature is no longer supported in 1.8. Anyone upgrading from 1.7 with AffinityInAnnotation feature enabled must ensure pods (specifically with pod anti-affinity PreferredDuringSchedulingIgnoredDuringExecution) with empty TopologyKey fields must be removed before upgrading to 1.8. (#49976, @aveshagarwal)

@jmcmeek
Copy link
Author

jmcmeek commented Nov 14, 2019

@chrisohaver Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants