Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PodTopologySpread strategy #413

Merged
merged 1 commit into from
Nov 13, 2020

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Sep 30, 2020

(This picks up from #383)
Fixes #146

This PR attempts to add a new re-balancing strategy for PodTopologySpreading, specifically by evicting pods which belong to topologies that violate any constraint's MaxSkew for that topology.

Because pods can only match topology constraints within the same namespace, I thought it made sense to write this strategy as a loop over all the namespaces in the cluster (we can obviously add filtering for namespaces). This lets us count all the constraints in the namespace, one at a time, and compare the counts of all matching pods (via labelSelector) in that topology. I think this also simplifies the data structures necessary to map namespace -> topologyKey -> labelSelector -> topologyValue -> pods.

In its current state it's a very "brute force" approach so any feedback would be very helpful. I've noted spots where it could be optimized and I'm sure there are others, but this seems like the general algorithm we'll need. This is definitely trickier than I thought so I'm really hoping to get more eyes on this attempt now.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2020
@damemi
Copy link
Contributor Author

damemi commented Sep 30, 2020

Note: As mentioned this needs some heavy review, and so it's still "wip-ish". I kept the original commits I picked for this but will squash them, as well as working on adding tests and updating docs.

README.md Outdated Show resolved Hide resolved
@@ -125,6 +125,7 @@ github.com/gogo/protobuf v1.1.1/go.mod h1:r8qH/GZQm5c6nD/R0oafs1akxWv10x8SbQlK7a
github.com/gogo/protobuf v1.2.1/go.mod h1:hp+jE20tsWTFYpLwKvXlhS1hjn+gTNwPg2I6zVXpSg4=
github.com/gogo/protobuf v1.3.1 h1:DqDEcV5aeaTmdFBePNpYsp3FlcVH/2ISVVM9Qf8PSls=
github.com/gogo/protobuf v1.3.1/go.mod h1:SlYgWuQ5SjCEi6WLHjHCa1yvBfUnHcTbrrZtXPKa29o=
github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b h1:VKtxabqXZkF25pY9ekfRL6a582T4P37/31XEstQ5p58=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd that this is being updated. Is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's leftover from one of the old commits I picked. Will double check this when I eventually squash but I imagine it will disappear then

@seanmalloy
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 1, 2020
@seanmalloy
Copy link
Member

/test pull-descheduler-verify-master

@damemi
Copy link
Contributor Author

damemi commented Oct 1, 2020

/hold
thanks for the review comments so far, I'll update those. but I was thinking about the main part of the algorithm, which determines how many pods to evict from a topologyZone, and I think it could be improved over what I currently have.

Right now, when zoneA skews zoneB by N pods, I'm just evicting N pods from zoneA. Ideally, all these pods would go into the scheduling queue and get scheduled evenly by their topology requirements. But it might make more sense to evict something close to N/2 pods from zoneA, which would ideally balance the zones without requiring so many evictions. You then need to consider how this affects future comparisons in this run (ie, then how does zoneA compare tozoneC). I was looking into this and it seems like maybe a weird partitioning problem. So I'm interested in what you guys think about this?

I'm pretty content with the algorithm for building the zone buckets, going namespace by namespace. I think this is just O(N) for pods in the cluster but in the worst case it could be N^2 if every pod has a constraint.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020
@Huang-Wei
Copy link

@damemi if needed, i should be able to take a review review in the next couple of days.

@seanmalloy
Copy link
Member

/hold
thanks for the review comments so far, I'll update those. but I was thinking about the main part of the algorithm, which determines how many pods to evict from a topologyZone, and I think it could be improved over what I currently have.

Right now, when zoneA skews zoneB by N pods, I'm just evicting N pods from zoneA. Ideally, all these pods would go into the scheduling queue and get scheduled evenly by their topology requirements. But it might make more sense to evict something close to N/2 pods from zoneA, which would ideally balance the zones without requiring so many evictions. You then need to consider how this affects future comparisons in this run (ie, then how does zoneA compare tozoneC). I was looking into this and it seems like maybe a weird partitioning problem. So I'm interested in what you guys think about this?

I'm pretty content with the algorithm for building the zone buckets, going namespace by namespace. I think this is just O(N) for pods in the cluster but in the worst case it could be N^2 if every pod has a constraint.

I did look through the code and based on my knowledge what you have looks correct to me. I think it would be great if @ingvagabund and @Huang-Wei could review when they have some time.

/cc @Huang-Wei @ingvagabund

sumPods++
}

// To actually determine how many pods need to be moved, we sort the topologies in ascending length
Copy link
Contributor Author

@damemi damemi Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added what I think is a better algorithm for determining how many pods to evict from topologies here (in df09c2a)

I put a lot of description in the comments, but basically what I'm doing is splitting the topologies into 2 lists, balanced around the "ideal average" (ie, the size if all pods were perfectly spread). Mathematically, it seems like all topologies need to fall within maxSkew of this average. So I built around those that are "borrowing" from the average to be above it, and those that are "lending" from the average to be below it.

So I compare the smallest topology to the largest topology (these are the most likely to be skewed, and the largest topology will have the most "borrowed" that it is able to pay back). If the largest topology has enough to satisfy the maxSkew of the smallest topology, we move on to check the same with the next largest "small" topology. If it does not, we move on to check with the next smallest "large" topology.

I've tried this with a couple manual examples but would like some help verifying that my assumptions are right. I think this would run in linear time (minus the sorting) and simplify the data structures needed to keep track of how many pods have been moved.

@damemi damemi force-pushed the podtopologyspread branch 3 times, most recently from 3961c55 to df09c2a Compare October 2, 2020 16:50
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 6, 2020
@damemi damemi force-pushed the podtopologyspread branch 2 times, most recently from 5523bc5 to 617e4eb Compare October 7, 2020 16:43
pkg/descheduler/strategies/topologyspreadconstraint.go Outdated Show resolved Hide resolved
constraintTopologies := make(map[topologyPair][]*v1.Pod)
// pre-populate the topologyPair map with all the topologies available from the nodeMap
// (we can't just build it from existing pods' nodes because a topology may have 0 pods)
for _, node := range nodeMap {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop is expensive. I'd suggest putting it in an outer loop right after we have calculated all constraints within the current namespace.

And, IMO even we should move it to the most outer loop after we have all constraints over all eligible namespaces - i.e., constraintTopologies is a map keyed with namespace, and values with map[topologyPair][]*v1.Pod - which would be the most efficient way.

BTW: you can move the logic of building nodeMap below calculating constraints, as if len(constraints) == 0, we can return earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop is expensive. I'd suggest putting it in an outer loop right after we have calculated all constraints within the current namespace.

The problem is this loop is meant to identify any topologyPairs/domains that have 0 pods in them. The only way for us to know that a label on a pair is a topologyKey (and not just a label) is to check against each constraint's key. So even if it's moved outside the loop, it will still end up being O(M*N) where we have to check each node for each constraint.

And, IMO even we should move it to the most outer loop after we have all constraints over all eligible namespaces - i.e., constraintTopologies is a map keyed with namespace, and values with map[topologyPair][]*v1.Pod - which would be the most efficient way.

I'm not sure the efficiency that's gained here, as we will still be looping over each namespace->constraint

BTW: you can move the logic of building nodeMap below calculating constraints, as if len(constraints) == 0, we can return earlier.

This loop is ranging over constraints, so if len(constraints) == 0 we don't have to worry about that

pkg/descheduler/strategies/topologyspreadconstraint.go Outdated Show resolved Hide resolved
// 2. for each topologySpreadConstraint in that namespace
for _, constrainedPod := range namespaceTopologySpreadConstrainPods {
// find pods to evict one constraint at a time
for _, constraint := range constrainedPod.Spec.TopologySpreadConstraints {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we have a deployment having N pods, and it has hard constraints. In this case, L131~L133 would deal with the same constraint N times, right?

Copy link

@Huang-Wei Huang-Wei Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think an ideal approach is to iterate over pods and gets a merged slice with tuple {[]constraints, nodeSelector/nodeAffinity} as its element. And later we process each tuple instead of each constraint.

(maybe tuple {constraint, nodeSelector/nodeAffinity} works better)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but I think we could also just cover this with a map[v1.TopologySpreadConstraint]struct{} (see new commit)

Copy link
Contributor

@lixiang233 lixiang233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that PodTopologySpread plugin supports Cluster-level default constraints, should we support this in this strategy too?

pkg/descheduler/strategies/topologyspreadconstraint.go Outdated Show resolved Hide resolved
pkg/descheduler/strategies/topologyspreadconstraint.go Outdated Show resolved Hide resolved
pkg/descheduler/strategies/topologyspreadconstraint.go Outdated Show resolved Hide resolved
pkg/descheduler/strategies/topologyspreadconstraint.go Outdated Show resolved Hide resolved
Comment on lines 151 to 153
if !evictable.IsEvictable(&namespacePods.Items[i]) {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unevictable pods are ignored while their constraints not, then pod-num in some topology domains may be less than it actually is. Suppose we have two domains A(1 pod) and B(3 Pods), maxSkew is 1, 1 pod on domain B should be evicted, but if all these pods except one pod on B are unevictable, no pod will be evicted because maxSkew hasn't be reached.
I think we can treat unevictable pods like pods with nodeAffinity, only ignore them when we try to evict them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the descheduler should ever consider unevictable pods as evictable, even in this case. Though it is worth noting in documentation that unevictable pods may hurt the effectiveness of this strategy, that's about the best we can do.

imo they are unevictable for a reason and we provide various knobs for the user to adjust that criteria

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lixiang233 revisiting this, after running the strategy in more live-cluster scenarios I apologize and take back what I said. I agree with you that we shouldn't be excluding unevictable pods at this point, but rather when we try to evict them. That's much more intuitive and you were absolutely correct!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However we should sort non-evictable pods to be considered last for eviction, so I think it could be helpful to make the evictable type exported so that I can pass it to my sort functions here. @ingvagabund wdyt?

@damemi damemi force-pushed the podtopologyspread branch 2 times, most recently from 6dab021 to a0c34a5 Compare October 21, 2020 19:55
Copy link
Contributor Author

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I'm pretty satisfied with the use case coverage in this strategy. Through feedback we have identified some edge cases and limitations of the implementation, but for a basic "balance the topologies" approach I think this works well.

I will try to add some more test cases. I also made some updates to the balancing algorithm to improve its effectiveness and readability. Please review some more and let me know if there are any updates (or edge cases you think I should add to the unit tests).

@damemi damemi force-pushed the podtopologyspread branch 3 times, most recently from 12f911e to 6489493 Compare October 23, 2020 15:17
Comment on lines 279 to 322
sort.Slice(list, func(i, j int) bool {
// any non-evictable pods should be considered last (ie, first in the list)
if !isEvictable(list[i]) || !isEvictable(list[j]) {
// false - i is the only non-evictable, so return true to put it first
// true - j is non-evictable, so return false to put j before i
// if true and both and non-evictable, order doesn't matter
return !isEvictable(list[j])
}

// if both pods have selectors/affinity, compare them by their priority
if hasSelectorOrAffinity(*list[i]) && hasSelectorOrAffinity(*list[j]) {
return comparePodsByPriority(list[i], list[j])
} else if hasSelectorOrAffinity(*list[i]) && !hasSelectorOrAffinity(*list[j]) {
// if just one has selectors/affinity
// i should come first
return true
} else if !hasSelectorOrAffinity(*list[i]) && hasSelectorOrAffinity(*list[j]) {
// if just one has selectors/affinity
// j should come first
return false
} else {
// if neither has selectors/affinity, fall back to just priority
return comparePodsByPriority(list[i], list[j])
}
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lixiang233 or @ingvagabund mind double checking this sorting code to make sure it's right? The goal is:

  1. Put non-evictable pods first
  2. Put pods with affinity or selector next
  3. Order remaining pods by priority

// false - i is the only non-evictable, so return true to put it first
// true - j is non-evictable, so return false to put j before i
// if true and both and non-evictable, order doesn't matter
return !isEvictable(list[j])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return !isEvictable(list[j]) will make evictable pod comes before non-evictable pod: if j is evictable, then i is non-evictable, !isEvictable(list[j]) is false, then j is before i. return isEvictable(list[j]) should be correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i is not evictable (short circuited):

  • If j is not evictable it does not matter if you shuffle it (return *)
  • If j is evictable, don't shuffle it (return true a.k.a isEvictable(list[j]))

If i is evictable:

  • If j is not evictable, shufle it (return true a.k.a !isEvictable(list[j]))
  • If j is evictable, it does not matter if you shuffle it (return *)

If both i and j are not evictable, it does not matter if you shuffle it (return *)

return !(isEvictable(list[i]) && !isEvictable(list[j])) will do. In other words, only when i is evictable and j is not, shuffle both. Otherwise do nothing (=return false).

iIsEvictable := isEvictable(list[i])
jIsEvictable := isEvictable(list[j])
if !iIsEvictable || !jIsEvictable {
	return !(iIsEvictable && !jIsEvictable)
}

Comment on lines 256 to 259
if aboveToEvict[i].Spec.NodeSelector != nil ||
(aboveToEvict[i].Spec.Affinity != nil &&
aboveToEvict[i].Spec.Affinity.NodeAffinity != nil &&
aboveToEvict[i].Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check if pod is evictable here, given we added non-evictable pods to the list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will just end up back on the same node

Node selector can target a set of nodes. The same holds for node affinity. What about to check if the node selector or affinity can target at most one node and only then continue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check if pod is evictable here, given we added non-evictable pods to the list.

Pods from podsForEviction are not evicted until all constraints are processed. You might be running the check multiple times compared to one if the pod participates in multiple constraints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node selector can target a set of nodes. The same holds for node affinity. What about to check if the node selector or affinity can target at most one node and only then continue?

That's a good point, I have actually been wondering about if it really makes sense to hard-exclude any pods with affinity/selector. My thinking was different though: the affinity conditions could change on the cluster making the pod perfectly evictable (see our own descheduling strategies). We could actually re-use some helpers from those strategies in informing this decision, but that's toeing the line of conflating 2 different strategies. I think it would be fine though wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund I added a helper here to count the # of nodes this pod could fit on (besides the one it's currently on). If that value is 0, it either only fits on this node or doesn't fit on any node, either way it's not PTS's job to evict it in that case.

I did add a TODO here, because I'm realizing since we don't order the pods with affinity based on how many other nodes they can fit on, there could be enough valid evictable pods with affinity that we don't get to (since this list is created with just aboveToEvict := sortedDomains[j].pods[len(sortedDomains[j].pods)-movePods:], which is efficient but optimistic).

The alternative would be to either add additional sorting for affinity pods based on other feasible nodes, or loop through all the pods in sortedDomains[j] and maintain a counter, and keep trying until count==movePods or k=0

pods []*v1.Pod
}

func validateTopologySpreadParams(ctx context.Context, client clientset.Interface, params *api.StrategyParameters) (int32, sets.String, sets.String, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateTopologySpreadParams is doing more than validating

var sumPods float64
for i := range namespacePods.Items {
// 4. if the pod matches this TopologySpreadConstraint LabelSelector
s, err := metav1.LabelSelectorAsSelector(constraint.LabelSelector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s is invariant in this for loop, no need to parse it for every pod in namespacePods.Items

Also, please, avoid using one letter variable names. sel or selector is easier to read and search for

node, ok := nodeMap[namespacePods.Items[i].Spec.NodeName]
if !ok {
klog.ErrorS(err, "Unknown nodeName for pod", "node", namespacePods.Items[i].Spec.NodeName, "pod", klog.KObj(&namespacePods.Items[i]))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ok is false, node is nil in which case node.Labels will panic. In which case a pod is yet to be scheduled. So it's safe to just continue here.

// 4. if the pod matches this TopologySpreadConstraint LabelSelector
s, err := metav1.LabelSelectorAsSelector(constraint.LabelSelector)
if err != nil {
klog.ErrorS(err, "Couldn't parse label selector as selector", "selector", constraint.LabelSelector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If err is not nil, s is nil. s.Matches will panic. In which case the constraint is invalid. Something to be checked on the apiserver side. Invalid constraints are refused at least by the scheduler, so continue is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still log the error, because label selectors aren't always guaranteed to be validated by the apiserver (what we've run into in the scheduler, and still find areas they aren't parsed correctly)


for pod := range podsForEviction {
node, ok := nodeMap[pod.Spec.NodeName]
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case should not happen as pods which are not scheduled have no contribution to the topology spreading. Thus, such pods should not be even evicted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes with what you mentioned above where I was checking nodeMap right? (#413 (comment)). So any pods that get to this point must have a node in the node map

belowAvg := math.Ceil(idealAvg - float64(len(sortedDomains[i].pods)))
smallestDiff := math.Min(aboveAvg, belowAvg)
halfSkew := math.Floor(skew / 2)
movePods := int(math.Min(smallestDiff, halfSkew))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are comparing smallestDiff (which is a float rounded up to an integer) and halfSkew, which can be non-integer. Should not you round up halfSkew first and then take the minimum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

halfSkew is rounded down


// remove pods from the higher topology and add them to the list of pods to be evicted
// also (just for tracking), add them to the list of pods in the lower topology
aboveToEvict := sortedDomains[j].pods[len(sortedDomains[j].pods)-movePods:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case idealAvg is 4.5, aboveAvg is 5 and selected as the smallest of {aboveAvg, belowAvg, halfSkew}, by removing the pod from sortedDomains[j].pods, the domain size will fall under idealAvg. Which contradicts again with how many it has remaining without falling below the average.

E.g. [3 5 5 5], idealAvg = 18/4 = 4.5, skew(i=0, j=3) = 2, halfSkew = 1, aboveAvg = 1, belowAvg = 2.

Copy link
Contributor Author

@damemi damemi Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, in that case of [3 5 5 5] we should choose 1 to make the new sizes [4 5 5 4]

I think really the comment just needs clarification that no domain will fall more than a whole pod under the ideal average (or maybe just "under the ideal average, rounded down"). Those fractions of a pod can mathematically add up and be allowed to move

// remove pods from the higher topology and add them to the list of pods to be evicted
// also (just for tracking), add them to the list of pods in the lower topology
aboveToEvict := sortedDomains[j].pods[len(sortedDomains[j].pods)-movePods:]
for i := range aboveToEvict {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use other index variable than i even though it's scoped only to this loop. E.g. k.

Comment on lines 256 to 259
if aboveToEvict[i].Spec.NodeSelector != nil ||
(aboveToEvict[i].Spec.Affinity != nil &&
aboveToEvict[i].Spec.Affinity.NodeAffinity != nil &&
aboveToEvict[i].Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will just end up back on the same node

Node selector can target a set of nodes. The same holds for node affinity. What about to check if the node selector or affinity can target at most one node and only then continue?

Comment on lines 256 to 259
if aboveToEvict[i].Spec.NodeSelector != nil ||
(aboveToEvict[i].Spec.Affinity != nil &&
aboveToEvict[i].Spec.Affinity.NodeAffinity != nil &&
aboveToEvict[i].Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check if pod is evictable here, given we added non-evictable pods to the list.

Pods from podsForEviction are not evicted until all constraints are processed. You might be running the check multiple times compared to one if the pod participates in multiple constraints.

@damemi damemi force-pushed the podtopologyspread branch 2 times, most recently from b90412c to b1b7012 Compare October 27, 2020 21:19
This adds a strategy to balance pod topology domains based on the scheduler's
PodTopologySpread constraints. It attempts to find the minimum number of pods
that should be sent for eviction by comparing the largest domains in a topology
with the smallest domains in that topology.
Copy link
Contributor Author

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold cancel
It's been a while since this had any feedback, and brought it up at the last sig meeting with no complaints :) ready for lgtm if there's no more comments

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 13, 2020
@seanmalloy
Copy link
Member

seanmalloy commented Nov 13, 2020

Let's do this. In my opinion it will be good to get end user feedback to improve this feature in the future. The only way to get end user feedback is to ship the feature.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2020
@k8s-ci-robot k8s-ci-robot merged commit 4839d5f into kubernetes-sigs:master Nov 13, 2020
briend pushed a commit to briend/descheduler that referenced this pull request Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Policy for Balancing Pods across topology domains
7 participants