Revisit pod node anti-affinity rules #197

jmcmeek · 2019-09-11T16:41:03Z

pod anti-affinity rules were revoked in #60 due to Kubernetes scheduler performance concerns.

Scheduler performance enhancements have been made since then and I think it is safe to use those now. See my question and reply on #sig-scheduler Slack channel
Q: https://kubernetes.slack.com/archives/C09TP78DV/p1568127000040100
R: https://kubernetes.slack.com/archives/C09TP78DV/p1568128953041900

jmcmeek · 2019-09-11T16:42:40Z

Also, should this include zone anti-affinty? Something like

      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values: ["kube-dns"]
              topologyKey: failure-domain.beta.kubernetes.io/zone
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values: ["kube-dns"]
              topologyKey: "kubernetes.io/hostname"

chrisohaver · 2019-09-11T17:27:14Z

The docs still warn against using them. (I do realize that k8s docs are not always updated proactively)

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity

Note: Inter-pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. We do not recommend using them in clusters larger than several hundred nodes.

Do you know which PR or commit(s) fixed it? I searched closed k/k PRs, and found one from April 2019, but that doesn't line up with the 1.13 release date (mentioned in the slack channel).

jmcmeek · 2019-09-11T17:59:45Z

Scouring change logs and git history...

Kubernetes change logs suggest 1.11 and 1.12 had work done for anti-affinity scheduler performance - it looks like 1.11 would have been the bigger.

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md#sig-scheduling-1

Performance of the anti-affinity predicate of the default scheduler has been improved. (#66948, @mohamed-mehany)

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#sig-scheduling-1

Performance of the affinity/anti-affinity predicate for the default scheduler has been significantly improved. (#62211, @bsalamat)

That note was added back in Oct 2017.
https://github.com/kubernetes/website/blame/master/content/en/docs/concepts/configuration/assign-pod-node.md#L182
kubernetes/website#5979

I think that's before the original kube-dns anti-affinity change in Dec 2017:
kubernetes/kubernetes#57683

And the 1.11 change ( #62211) dates to Apr 2013 - after anti-affinity had been pulled out.

I'll ask sig-scheduling if that warning is obsolete.

chrisohaver · 2019-09-11T18:14:05Z

Thanks @jmcmeek !

Not sure about the 50 weights ... or how to mix zone and node anti-affinity correctly.
Perhaps using an empty topology key would work just as well, and be shorter...

From https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity

For preferredDuringSchedulingIgnoredDuringExecution pod anti-affinity, empty topologyKey is interpreted as “all topologies” (“all topologies” here is now limited to the combination of kubernetes.io/hostname, failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region).

jmcmeek · 2019-09-11T18:36:27Z

Checked with sig-scheduling about that warning: at https://kubernetes.slack.com/archives/C09TP78DV/p1568225168053800

Huang-Wei  16 minutes ago
that's a bit obsolete
Huang-Wei  14 minutes ago
but I want to ensure that although we did some improvement, the potential overhead still exists which is inevitable
Huang-Wei  13 minutes ago
in terms of cluster size, in some Github issue, some users mentioned they are using PodAffinity in 5k node cluster

chrisohaver · 2019-09-11T18:44:54Z

OK Thanks, we can add it and see if it passes the 5k k8s scale tests. We can add it here now, but as for merging into k8s (e.g. kubeadm/kubeup) 1.16 is already in code freeze, so it wont actually get scale tested at 5k nodes until later on in the k8s 1.17 release.

jmcmeek · 2019-09-11T18:47:09Z

@chrisohaver I did limited testing with the snippet I posted. It "worked for me".

failure-domain.beta.kubernetes.io/zone might not always be present - a cluster kubeadm - so maybe it would be best to not do that.

Note: Pod anti-affinity requires nodes to be consistently labelled, i.e. every node in the cluster must have an appropriate label matching topologyKey. If some or all nodes are missing the specified topologyKey label, it can lead to unintended behavior.

I'll have to checkout the empty topologyKey idea. It sounds promising.

jmcmeek · 2019-09-11T20:38:51Z

On the surface, empty topologyKey does not seem to be allowed.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.15/#podaffinityterm-v1-core

topologyKey string This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.

I found this in Kubernetes 1.8 changelog suggesting the note about using an empty topologyKey is old.

Affinity in annotations alpha feature is no longer supported in 1.8. Anyone upgrading from 1.7 with AffinityInAnnotation feature enabled must ensure pods (specifically with pod anti-affinity PreferredDuringSchedulingIgnoredDuringExecution) with empty TopologyKey fields must be removed before upgrading to 1.8. (#49976, @aveshagarwal)

jmcmeek · 2019-11-14T14:34:52Z

@chrisohaver Thank you!

chrisohaver added this to the k8s 1.17 milestone Oct 9, 2019

rajansandeep mentioned this issue Oct 31, 2019

Update CoreDNS version and update manifest #206

Merged

chrisohaver closed this as completed in #206 Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit pod node anti-affinity rules #197

Revisit pod node anti-affinity rules #197

jmcmeek commented Sep 11, 2019

jmcmeek commented Sep 11, 2019

chrisohaver commented Sep 11, 2019 •

edited

Loading

jmcmeek commented Sep 11, 2019

chrisohaver commented Sep 11, 2019

jmcmeek commented Sep 11, 2019

chrisohaver commented Sep 11, 2019 •

edited

Loading

jmcmeek commented Sep 11, 2019

jmcmeek commented Sep 11, 2019

jmcmeek commented Nov 14, 2019

Revisit pod node anti-affinity rules #197

Revisit pod node anti-affinity rules #197

Comments

jmcmeek commented Sep 11, 2019

jmcmeek commented Sep 11, 2019

chrisohaver commented Sep 11, 2019 • edited Loading

jmcmeek commented Sep 11, 2019

chrisohaver commented Sep 11, 2019

jmcmeek commented Sep 11, 2019

chrisohaver commented Sep 11, 2019 • edited Loading

jmcmeek commented Sep 11, 2019

jmcmeek commented Sep 11, 2019

jmcmeek commented Nov 14, 2019

chrisohaver commented Sep 11, 2019 •

edited

Loading

chrisohaver commented Sep 11, 2019 •

edited

Loading