-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TopologySpreadConstraint: only evaluate nodes below ideal avg when balancing domains #836
TopologySpreadConstraint: only evaluate nodes below ideal avg when balancing domains #836
Conversation
/cc @damemi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign |
@@ -284,7 +291,7 @@ func balanceDomains( | |||
// In other words, PTS can perform suboptimally if some of its chosen pods don't fit on other nodes. | |||
// This is because the chosen pods aren't sorted, but immovable pods still count as "evicted" toward the PTS algorithm. | |||
// So, a better selection heuristic could improve performance. | |||
if !node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], nodes) { | |||
if !node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], topologyNodesMap[sortedDomains[i].pair.value]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This limits the check to nodes with the same topology domain value. Which makes the solution less suboptimal. Example:
Lets have the following distribution of pods among domains:
- nodeA1{zone=A, pods capacity=1}: 1
- nodeA2{zone=A, pods capacity=4}: 4
- nodeB1{zone=B, pods capacity=2}: 2
- nodeB2{zone=B, pods capacity=2}: 1
After getting processed:
sortedDomains = [
topology{
pair={key=zone,value=B},
pods=[6,7,8],
},
topology{
pair={key=zone,value=A},
pods=[1,2,3,4,5],
},
]
topologyNodesMap = {
"A": [nodeA1, nodeA2],
"B": [nodeB1, nodeB2],
}
skew=2 (5-3)
To get skew=1 one can move a pod from nodeA2 to nodeB2 since nodeA1 does not have resources for 2 pods. With the change applied in this PR node.PodFitsAnyOtherNode
will be false as with j=1
topologyNodesMap[sortedDomains[1].pair.value] = topologyNodesMap["A"] = [nodeA1, nodeA2]
.
Did you consider this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @knelasevero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually looks at i
(belowAvg) and not j
so:
i=0 topologyNodesMap[sortedDomains[0].pair.value] = topologyNodesMap["B"] = [nodeB1, nodeB2]
I pushed up a unit test covering your use-case which is passing:
3f23278
Let me know if I captured it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, you got my counterexample right. I used wrong example while getting lost in the indexes. Thanks for proving me wrong :) I will re-iterate.
Before your change the idea of using all nodes was that when node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], nodes)
is false (meaning there's no node left to rescheduling), you'd skip a given pod and moved to another pod from aboveToEvict
. So after iterating through all the pods in aboveToEvict
, you either extended podsForEviction
(a pod can be safely evicted) or skipped (a pod is non-movable). So it was safe to shorten sortedDomains[j].pods
as the non-movable pods can't be evicted so you skip them.
With your change applied node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], topologyNodesMap[sortedDomains[i].pair.value])
reduces the problem from "is there any node suitable for re-scheduling" to "is there any node in the i-th domain suitable for re-scheduling". So when the condition renders false, you skip podsForEviction[aboveToEvict[k]] = struct{}{}
step. Though, you still remove all the aboveToEvict
pods from sortedDomains[j].pods
. So when you re-run the next for i < j
iteration, you will no longer test the remaining "below avg" domains if they have free space for accepting aboveToEvict
pods that were skipped in the previous iteration due to node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], topologyNodesMap[sortedDomains[i].pair.value])
being false.
sortedDomains = [
topology{
pair={key=zone,value=A},
pods=[1],
},
topology{
pair={key=zone,value=B},
pods=[2],
},
topology{
pair={key=zone,value=C},
pods=[3,4,5,6,7],
},
topology{
pair={key=zone,value=D},
pods=[8,9,10,11,12],
},
topology{
pair={key=zone,value=E},
pods=[13,14,15,16,17,18,19,20,21],
},
topology{
pair={key=zone,value=F},
pods=[22,23,24,25,26,27,28,29,30],
},
]
topologyNodesMap = {
"A": [nodeA],
"B": [nodeB],
"C": [nodeC],
"D": [nodeD],
"E": [nodeE],
"F": [nodeF],
}
skew=8 (9-1)
Assuming we want to get skew=2 while pods 29
and 30
do not have the right taints to tolerate nodeA but can land on nodeB. With i=0 and j=5: idealAvg=5
, aboveAvg=4
, belowAvg=4
, smallestDiff=4
, halfSkew=3
, movePods=3
.
aboveToEvict := sortedDomains[5].pods[len(sortedDomains[5].pods)-movePods:] = [sortedDomains[5].pods[9-3:] = [sortedDomains[5].pods[6:] = [28,29,30]
Assuming node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], topologyNodesMap[sortedDomains[i].pair.value])
is false
for k = [0,1,2], podsForEviction
is empty while sortedDomains[j].pods
got truncated for [28,29,30]
, sortedDomains[i].pods
was extended by [28,29,30]
. However, [28,29,30]
gets forgotten a will not get evicted.
If you repeat the same for j=4 and i=1, [19,20,21]
will don't get evicted while the algorithm will think they get.
Giving a bit different example since two domains of the same size can be sorted in two ways ("[a,b] or [b,a]"). So here every domain has a different size. While still the same approach applies. Pods [27,28,29]
never gets evicted as the first domain (A) does not have enough cpu. The second domain (B) has enough cpu however all [27,28,29]
pods do not get evicted since in the second iteration (i=1) they are already moved under domain A but without extending podsForEviction
.
{
name: "5 domains, sizes [[1], [2], [4], [6], [8], [9]], maxSkew=2, NodeFit is enabled; should move 5",
nodes: []*v1.Node{
test.BuildTestNode("A", 100, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneA" }),
test.BuildTestNode("B", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneB" }),
test.BuildTestNode("C", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneC" }),
test.BuildTestNode("D", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneD" }),
test.BuildTestNode("E", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneE" }),
test.BuildTestNode("F", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneF" }),
},
pods: createTestPods([]testPodList{
{
count: 1,
node: "A",
labels: map[string]string{"foo": "bar"},
constraints: getDefaultTopologyConstraints(2),
},
{
count: 2,
node: "B",
labels: map[string]string{"foo": "bar"},
constraints: getDefaultTopologyConstraints(2),
},
{
count: 4,
node: "C",
labels: map[string]string{"foo": "bar"},
constraints: getDefaultTopologyConstraints(2),
},
{
count: 6,
node: "D",
labels: map[string]string{"foo": "bar"},
constraints: getDefaultTopologyConstraints(2),
},
{
count: 8,
node: "E",
labels: map[string]string{"foo": "bar"},
constraints: getDefaultTopologyConstraints(2),
},
{
count: 9,
node: "F",
labels: map[string]string{"foo": "bar"},
constraints: getDefaultTopologyConstraints(2),
},
}),
expectedEvictedCount: 5,
strategy: api.DeschedulerStrategy{
Params: &api.StrategyParameters{NodeFit: true},
},
namespaces: []string{"ns1"},
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me so far. Very cool to now have specific pod evictions being checked. I want to go over it a bit more consciously and circle back here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still wonder about nodes with taints/affinity, which are not being removed (where not tolerated) prior to our domain size calculation.
but this is still an improvement to the current calculation causing eviction loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been running this in production and seeing a lot of improvements. Any chance this can be a patch candidate for 0.24?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to sound lazy, but due to the work involved with patch releases and the timing for the next release, non-critical fixes are generally not worth the effort. I'm not opposed to it if there is a lot of demand though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3f23278
to
e50cfeb
Compare
e50cfeb
to
3b94e35
Compare
3b94e35
to
7a5e67d
Compare
…ed nodes and eviction loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damemi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
thanks your hard working @a7i
Closes #835 #839
TopologySpreadConstraint should only evaluate nodes below ideal avg when balancing domains to prevent eviction loop when nodes aboveAvg are taken into account.