-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PodTopologySpread strategy #413
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damemi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Note: As mentioned this needs some heavy review, and so it's still "wip-ish". I kept the original commits I picked for this but will squash them, as well as working on adding tests and updating docs. |
@@ -125,6 +125,7 @@ github.com/gogo/protobuf v1.1.1/go.mod h1:r8qH/GZQm5c6nD/R0oafs1akxWv10x8SbQlK7a | |||
github.com/gogo/protobuf v1.2.1/go.mod h1:hp+jE20tsWTFYpLwKvXlhS1hjn+gTNwPg2I6zVXpSg4= | |||
github.com/gogo/protobuf v1.3.1 h1:DqDEcV5aeaTmdFBePNpYsp3FlcVH/2ISVVM9Qf8PSls= | |||
github.com/gogo/protobuf v1.3.1/go.mod h1:SlYgWuQ5SjCEi6WLHjHCa1yvBfUnHcTbrrZtXPKa29o= | |||
github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b h1:VKtxabqXZkF25pY9ekfRL6a582T4P37/31XEstQ5p58= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd that this is being updated. Is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's leftover from one of the old commits I picked. Will double check this when I eventually squash but I imagine it will disappear then
/kind feature |
/test pull-descheduler-verify-master |
/hold Right now, when I'm pretty content with the algorithm for building the zone buckets, going namespace by namespace. I think this is just O(N) for pods in the cluster but in the worst case it could be N^2 if every pod has a constraint. |
@damemi if needed, i should be able to take a review review in the next couple of days. |
I did look through the code and based on my knowledge what you have looks correct to me. I think it would be great if @ingvagabund and @Huang-Wei could review when they have some time. |
sumPods++ | ||
} | ||
|
||
// To actually determine how many pods need to be moved, we sort the topologies in ascending length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added what I think is a better algorithm for determining how many pods to evict from topologies here (in df09c2a)
I put a lot of description in the comments, but basically what I'm doing is splitting the topologies into 2 lists, balanced around the "ideal average" (ie, the size if all pods were perfectly spread). Mathematically, it seems like all topologies need to fall within maxSkew
of this average. So I built around those that are "borrowing" from the average to be above it, and those that are "lending" from the average to be below it.
So I compare the smallest topology to the largest topology (these are the most likely to be skewed, and the largest topology will have the most "borrowed" that it is able to pay back). If the largest topology has enough to satisfy the maxSkew
of the smallest topology, we move on to check the same with the next largest "small" topology. If it does not, we move on to check with the next smallest "large" topology.
I've tried this with a couple manual examples but would like some help verifying that my assumptions are right. I think this would run in linear time (minus the sorting) and simplify the data structures needed to keep track of how many pods have been moved.
3961c55
to
df09c2a
Compare
5523bc5
to
617e4eb
Compare
constraintTopologies := make(map[topologyPair][]*v1.Pod) | ||
// pre-populate the topologyPair map with all the topologies available from the nodeMap | ||
// (we can't just build it from existing pods' nodes because a topology may have 0 pods) | ||
for _, node := range nodeMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for
loop is expensive. I'd suggest putting it in an outer loop right after we have calculated all constraints within the current namespace.
And, IMO even we should move it to the most outer loop after we have all constraints over all eligible namespaces - i.e., constraintTopologies is a map keyed with namespace, and values with map[topologyPair][]*v1.Pod
- which would be the most efficient way.
BTW: you can move the logic of building nodeMap below calculating constraints, as if len(constraints) == 0, we can return earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
for
loop is expensive. I'd suggest putting it in an outer loop right after we have calculated all constraints within the current namespace.
The problem is this loop is meant to identify any topologyPairs/domains that have 0 pods in them. The only way for us to know that a label on a pair is a topologyKey (and not just a label) is to check against each constraint's key. So even if it's moved outside the loop, it will still end up being O(M*N) where we have to check each node for each constraint.
And, IMO even we should move it to the most outer loop after we have all constraints over all eligible namespaces - i.e., constraintTopologies is a map keyed with namespace, and values with
map[topologyPair][]*v1.Pod
- which would be the most efficient way.
I'm not sure the efficiency that's gained here, as we will still be looping over each namespace->constraint
BTW: you can move the logic of building nodeMap below calculating constraints, as if len(constraints) == 0, we can return earlier.
This loop is ranging over constraints
, so if len(constraints) == 0
we don't have to worry about that
// 2. for each topologySpreadConstraint in that namespace | ||
for _, constrainedPod := range namespaceTopologySpreadConstrainPods { | ||
// find pods to evict one constraint at a time | ||
for _, constraint := range constrainedPod.Spec.TopologySpreadConstraints { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we have a deployment having N pods, and it has hard constraints. In this case, L131~L133 would deal with the same constraint N times, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think an ideal approach is to iterate over pods and gets a merged slice with tuple {[]constraints, nodeSelector/nodeAffinity}
as its element. And later we process each tuple instead of each constraint.
(maybe tuple {constraint, nodeSelector/nodeAffinity}
works better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but I think we could also just cover this with a map[v1.TopologySpreadConstraint]struct{}
(see new commit)
c3f3b7f
to
63fd28b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that PodTopologySpread plugin supports Cluster-level default constraints, should we support this in this strategy too?
if !evictable.IsEvictable(&namespacePods.Items[i]) { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unevictable pods are ignored while their constraints not, then pod-num in some topology domains may be less than it actually is. Suppose we have two domains A(1 pod) and B(3 Pods), maxSkew is 1, 1 pod on domain B should be evicted, but if all these pods except one pod on B are unevictable, no pod will be evicted because maxSkew hasn't be reached.
I think we can treat unevictable pods like pods with nodeAffinity, only ignore them when we try to evict them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the descheduler should ever consider unevictable pods as evictable, even in this case. Though it is worth noting in documentation that unevictable pods may hurt the effectiveness of this strategy, that's about the best we can do.
imo they are unevictable for a reason and we provide various knobs for the user to adjust that criteria
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lixiang233 revisiting this, after running the strategy in more live-cluster scenarios I apologize and take back what I said. I agree with you that we shouldn't be excluding unevictable pods at this point, but rather when we try to evict them. That's much more intuitive and you were absolutely correct!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However we should sort non-evictable pods to be considered last for eviction, so I think it could be helpful to make the evictable
type exported so that I can pass it to my sort functions here. @ingvagabund wdyt?
6dab021
to
a0c34a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I'm pretty satisfied with the use case coverage in this strategy. Through feedback we have identified some edge cases and limitations of the implementation, but for a basic "balance the topologies" approach I think this works well.
I will try to add some more test cases. I also made some updates to the balancing algorithm to improve its effectiveness and readability. Please review some more and let me know if there are any updates (or edge cases you think I should add to the unit tests).
12f911e
to
6489493
Compare
sort.Slice(list, func(i, j int) bool { | ||
// any non-evictable pods should be considered last (ie, first in the list) | ||
if !isEvictable(list[i]) || !isEvictable(list[j]) { | ||
// false - i is the only non-evictable, so return true to put it first | ||
// true - j is non-evictable, so return false to put j before i | ||
// if true and both and non-evictable, order doesn't matter | ||
return !isEvictable(list[j]) | ||
} | ||
|
||
// if both pods have selectors/affinity, compare them by their priority | ||
if hasSelectorOrAffinity(*list[i]) && hasSelectorOrAffinity(*list[j]) { | ||
return comparePodsByPriority(list[i], list[j]) | ||
} else if hasSelectorOrAffinity(*list[i]) && !hasSelectorOrAffinity(*list[j]) { | ||
// if just one has selectors/affinity | ||
// i should come first | ||
return true | ||
} else if !hasSelectorOrAffinity(*list[i]) && hasSelectorOrAffinity(*list[j]) { | ||
// if just one has selectors/affinity | ||
// j should come first | ||
return false | ||
} else { | ||
// if neither has selectors/affinity, fall back to just priority | ||
return comparePodsByPriority(list[i], list[j]) | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lixiang233 or @ingvagabund mind double checking this sorting code to make sure it's right? The goal is:
- Put non-evictable pods first
- Put pods with affinity or selector next
- Order remaining pods by priority
6489493
to
dc9ed08
Compare
// false - i is the only non-evictable, so return true to put it first | ||
// true - j is non-evictable, so return false to put j before i | ||
// if true and both and non-evictable, order doesn't matter | ||
return !isEvictable(list[j]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return !isEvictable(list[j])
will make evictable pod
comes before non-evictable
pod: if j is evictable, then i is non-evictable, !isEvictable(list[j])
is false, then j is before i. return isEvictable(list[j])
should be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If i
is not evictable (short circuited):
- If
j
is not evictable it does not matter if you shuffle it (return *
) - If
j
is evictable, don't shuffle it (return true a.k.aisEvictable(list[j])
)
If i
is evictable:
- If
j
is not evictable, shufle it (return true a.k.a!isEvictable(list[j])
) - If
j
is evictable, it does not matter if you shuffle it (return *
)
If both i
and j
are not evictable, it does not matter if you shuffle it (return *
)
return !(isEvictable(list[i]) && !isEvictable(list[j]))
will do. In other words, only when i
is evictable and j
is not, shuffle both. Otherwise do nothing (=return false
).
iIsEvictable := isEvictable(list[i])
jIsEvictable := isEvictable(list[j])
if !iIsEvictable || !jIsEvictable {
return !(iIsEvictable && !jIsEvictable)
}
if aboveToEvict[i].Spec.NodeSelector != nil || | ||
(aboveToEvict[i].Spec.Affinity != nil && | ||
aboveToEvict[i].Spec.Affinity.NodeAffinity != nil && | ||
aboveToEvict[i].Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check if pod is evictable here, given we added non-evictable pods to the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will just end up back on the same node
Node selector can target a set of nodes. The same holds for node affinity. What about to check if the node selector or affinity can target at most one node and only then continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check if pod is evictable here, given we added non-evictable pods to the list.
Pods from podsForEviction
are not evicted until all constraints are processed. You might be running the check multiple times compared to one if the pod participates in multiple constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node selector can target a set of nodes. The same holds for node affinity. What about to check if the node selector or affinity can target at most one node and only then continue?
That's a good point, I have actually been wondering about if it really makes sense to hard-exclude any pods with affinity/selector. My thinking was different though: the affinity conditions could change on the cluster making the pod perfectly evictable (see our own descheduling strategies). We could actually re-use some helpers from those strategies in informing this decision, but that's toeing the line of conflating 2 different strategies. I think it would be fine though wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund I added a helper here to count the # of nodes this pod could fit on (besides the one it's currently on). If that value is 0, it either only fits on this node or doesn't fit on any node, either way it's not PTS's job to evict it in that case.
I did add a TODO here, because I'm realizing since we don't order the pods with affinity based on how many other nodes they can fit on, there could be enough valid evictable pods with affinity that we don't get to (since this list is created with just aboveToEvict := sortedDomains[j].pods[len(sortedDomains[j].pods)-movePods:]
, which is efficient but optimistic).
The alternative would be to either add additional sorting for affinity pods based on other feasible nodes, or loop through all the pods in sortedDomains[j]
and maintain a counter, and keep trying until count==movePods or k=0
pods []*v1.Pod | ||
} | ||
|
||
func validateTopologySpreadParams(ctx context.Context, client clientset.Interface, params *api.StrategyParameters) (int32, sets.String, sets.String, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validateTopologySpreadParams
is doing more than validating
var sumPods float64 | ||
for i := range namespacePods.Items { | ||
// 4. if the pod matches this TopologySpreadConstraint LabelSelector | ||
s, err := metav1.LabelSelectorAsSelector(constraint.LabelSelector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s
is invariant in this for loop, no need to parse it for every pod in namespacePods.Items
Also, please, avoid using one letter variable names. sel
or selector
is easier to read and search for
node, ok := nodeMap[namespacePods.Items[i].Spec.NodeName] | ||
if !ok { | ||
klog.ErrorS(err, "Unknown nodeName for pod", "node", namespacePods.Items[i].Spec.NodeName, "pod", klog.KObj(&namespacePods.Items[i])) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ok
is false, node
is nil in which case node.Labels
will panic. In which case a pod is yet to be scheduled. So it's safe to just continue
here.
// 4. if the pod matches this TopologySpreadConstraint LabelSelector | ||
s, err := metav1.LabelSelectorAsSelector(constraint.LabelSelector) | ||
if err != nil { | ||
klog.ErrorS(err, "Couldn't parse label selector as selector", "selector", constraint.LabelSelector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If err
is not nil, s
is nil. s.Matches
will panic. In which case the constraint is invalid. Something to be checked on the apiserver side. Invalid constraints are refused at least by the scheduler, so continue
is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still log the error, because label selectors aren't always guaranteed to be validated by the apiserver (what we've run into in the scheduler, and still find areas they aren't parsed correctly)
|
||
for pod := range podsForEviction { | ||
node, ok := nodeMap[pod.Spec.NodeName] | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case should not happen as pods which are not scheduled have no contribution to the topology spreading. Thus, such pods should not be even evicted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes with what you mentioned above where I was checking nodeMap
right? (#413 (comment)). So any pods that get to this point must have a node in the node map
belowAvg := math.Ceil(idealAvg - float64(len(sortedDomains[i].pods))) | ||
smallestDiff := math.Min(aboveAvg, belowAvg) | ||
halfSkew := math.Floor(skew / 2) | ||
movePods := int(math.Min(smallestDiff, halfSkew)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are comparing smallestDiff
(which is a float rounded up to an integer) and halfSkew
, which can be non-integer. Should not you round up halfSkew
first and then take the minimum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
halfSkew
is rounded down
|
||
// remove pods from the higher topology and add them to the list of pods to be evicted | ||
// also (just for tracking), add them to the list of pods in the lower topology | ||
aboveToEvict := sortedDomains[j].pods[len(sortedDomains[j].pods)-movePods:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case idealAvg
is 4.5, aboveAvg
is 5
and selected as the smallest of {aboveAvg, belowAvg, halfSkew}
, by removing the pod from sortedDomains[j].pods
, the domain size will fall under idealAvg
. Which contradicts again with how many it has remaining without falling below the average
.
E.g. [3 5 5 5], idealAvg
= 18/4 = 4.5
, skew(i=0, j=3)
= 2
, halfSkew
= 1
, aboveAvg
= 1
, belowAvg
= 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, in that case of [3 5 5 5] we should choose 1 to make the new sizes [4 5 5 4]
I think really the comment just needs clarification that no domain will fall more than a whole pod under the ideal average (or maybe just "under the ideal average, rounded down"). Those fractions of a pod can mathematically add up and be allowed to move
// remove pods from the higher topology and add them to the list of pods to be evicted | ||
// also (just for tracking), add them to the list of pods in the lower topology | ||
aboveToEvict := sortedDomains[j].pods[len(sortedDomains[j].pods)-movePods:] | ||
for i := range aboveToEvict { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use other index variable than i
even though it's scoped only to this loop. E.g. k
.
if aboveToEvict[i].Spec.NodeSelector != nil || | ||
(aboveToEvict[i].Spec.Affinity != nil && | ||
aboveToEvict[i].Spec.Affinity.NodeAffinity != nil && | ||
aboveToEvict[i].Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will just end up back on the same node
Node selector can target a set of nodes. The same holds for node affinity. What about to check if the node selector or affinity can target at most one node and only then continue?
if aboveToEvict[i].Spec.NodeSelector != nil || | ||
(aboveToEvict[i].Spec.Affinity != nil && | ||
aboveToEvict[i].Spec.Affinity.NodeAffinity != nil && | ||
aboveToEvict[i].Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check if pod is evictable here, given we added non-evictable pods to the list.
Pods from podsForEviction
are not evicted until all constraints are processed. You might be running the check multiple times compared to one if the pod participates in multiple constraints.
b90412c
to
b1b7012
Compare
This adds a strategy to balance pod topology domains based on the scheduler's PodTopologySpread constraints. It attempts to find the minimum number of pods that should be sent for eviction by comparing the largest domains in a topology with the smallest domains in that topology.
b1b7012
to
4108362
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold cancel
It's been a while since this had any feedback, and brought it up at the last sig meeting with no complaints :) ready for lgtm if there's no more comments
Let's do this. In my opinion it will be good to get end user feedback to improve this feature in the future. The only way to get end user feedback is to ship the feature. /lgtm |
Add PodTopologySpread strategy
(This picks up from #383)
Fixes #146
This PR attempts to add a new re-balancing strategy for PodTopologySpreading, specifically by evicting pods which belong to topologies that violate any constraint's
MaxSkew
for that topology.Because pods can only match topology constraints within the same namespace, I thought it made sense to write this strategy as a loop over all the namespaces in the cluster (we can obviously add filtering for namespaces). This lets us count all the constraints in the namespace, one at a time, and compare the counts of all matching pods (via labelSelector) in that topology. I think this also simplifies the data structures necessary to map
namespace -> topologyKey -> labelSelector -> topologyValue -> pods
.In its current state it's a very "brute force" approach so any feedback would be very helpful. I've noted spots where it could be optimized and I'm sure there are others, but this seems like the general algorithm we'll need. This is definitely trickier than I thought so I'm really hoping to get more eyes on this attempt now.