Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TopologySpreadConstraint: only evaluate nodes below ideal avg when balancing domains #836

Merged
merged 3 commits into from
Jul 4, 2022

Conversation

a7i
Copy link
Contributor

@a7i a7i commented Jun 5, 2022

Closes #835 #839

TopologySpreadConstraint should only evaluate nodes below ideal avg when balancing domains to prevent eviction loop when nodes aboveAvg are taken into account.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 5, 2022
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 5, 2022
@a7i
Copy link
Contributor Author

a7i commented Jun 5, 2022

/cc @damemi

@k8s-ci-robot k8s-ci-robot requested a review from damemi June 5, 2022 22:34
Copy link
Member

@JaneLiuL JaneLiuL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2022
@ingvagabund
Copy link
Contributor

/assign

@@ -284,7 +291,7 @@ func balanceDomains(
// In other words, PTS can perform suboptimally if some of its chosen pods don't fit on other nodes.
// This is because the chosen pods aren't sorted, but immovable pods still count as "evicted" toward the PTS algorithm.
// So, a better selection heuristic could improve performance.
if !node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], nodes) {
if !node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], topologyNodesMap[sortedDomains[i].pair.value]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limits the check to nodes with the same topology domain value. Which makes the solution less suboptimal. Example:
Lets have the following distribution of pods among domains:

  • nodeA1{zone=A, pods capacity=1}: 1
  • nodeA2{zone=A, pods capacity=4}: 4
  • nodeB1{zone=B, pods capacity=2}: 2
  • nodeB2{zone=B, pods capacity=2}: 1

After getting processed:

sortedDomains = [
	topology{
		pair={key=zone,value=B},
		pods=[6,7,8],
	},
	topology{
		pair={key=zone,value=A},
		pods=[1,2,3,4,5],
	},
]
topologyNodesMap = {
	"A": [nodeA1, nodeA2],
	"B": [nodeB1, nodeB2],
}
skew=2 (5-3)

To get skew=1 one can move a pod from nodeA2 to nodeB2 since nodeA1 does not have resources for 2 pods. With the change applied in this PR node.PodFitsAnyOtherNode will be false as with j=1 topologyNodesMap[sortedDomains[1].pair.value] = topologyNodesMap["A"] = [nodeA1, nodeA2].

Did you consider this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually looks at i (belowAvg) and not j so:

i=0 topologyNodesMap[sortedDomains[0].pair.value] = topologyNodesMap["B"] = [nodeB1, nodeB2]

I pushed up a unit test covering your use-case which is passing:
3f23278

Let me know if I captured it correctly.

Copy link
Contributor

@ingvagabund ingvagabund Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, you got my counterexample right. I used wrong example while getting lost in the indexes. Thanks for proving me wrong :) I will re-iterate.

Before your change the idea of using all nodes was that when node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], nodes) is false (meaning there's no node left to rescheduling), you'd skip a given pod and moved to another pod from aboveToEvict. So after iterating through all the pods in aboveToEvict, you either extended podsForEviction (a pod can be safely evicted) or skipped (a pod is non-movable). So it was safe to shorten sortedDomains[j].pods as the non-movable pods can't be evicted so you skip them.

With your change applied node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], topologyNodesMap[sortedDomains[i].pair.value]) reduces the problem from "is there any node suitable for re-scheduling" to "is there any node in the i-th domain suitable for re-scheduling". So when the condition renders false, you skip podsForEviction[aboveToEvict[k]] = struct{}{} step. Though, you still remove all the aboveToEvict pods from sortedDomains[j].pods. So when you re-run the next for i < j iteration, you will no longer test the remaining "below avg" domains if they have free space for accepting aboveToEvict pods that were skipped in the previous iteration due to node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], topologyNodesMap[sortedDomains[i].pair.value]) being false.

sortedDomains = [
	topology{
		pair={key=zone,value=A},
		pods=[1],
	},
	topology{
		pair={key=zone,value=B},
		pods=[2],
	},
	topology{
		pair={key=zone,value=C},
		pods=[3,4,5,6,7],
	},
	topology{
		pair={key=zone,value=D},
		pods=[8,9,10,11,12],
	},
	topology{
		pair={key=zone,value=E},
		pods=[13,14,15,16,17,18,19,20,21],
	},
	topology{
		pair={key=zone,value=F},
		pods=[22,23,24,25,26,27,28,29,30],
	},
]
topologyNodesMap = {
	"A": [nodeA],
	"B": [nodeB],
	"C": [nodeC],
	"D": [nodeD],
	"E": [nodeE],
	"F": [nodeF],
}
skew=8 (9-1)

Assuming we want to get skew=2 while pods 29 and 30 do not have the right taints to tolerate nodeA but can land on nodeB. With i=0 and j=5: idealAvg=5, aboveAvg=4, belowAvg=4, smallestDiff=4, halfSkew=3, movePods=3.

aboveToEvict := sortedDomains[5].pods[len(sortedDomains[5].pods)-movePods:] = [sortedDomains[5].pods[9-3:] = [sortedDomains[5].pods[6:] = [28,29,30]

Assuming node.PodFitsAnyOtherNode(getPodsAssignedToNode, aboveToEvict[k], topologyNodesMap[sortedDomains[i].pair.value]) is false for k = [0,1,2], podsForEviction is empty while sortedDomains[j].pods got truncated for [28,29,30], sortedDomains[i].pods was extended by [28,29,30]. However, [28,29,30] gets forgotten a will not get evicted.

If you repeat the same for j=4 and i=1, [19,20,21] will don't get evicted while the algorithm will think they get.

Giving a bit different example since two domains of the same size can be sorted in two ways ("[a,b] or [b,a]"). So here every domain has a different size. While still the same approach applies. Pods [27,28,29] never gets evicted as the first domain (A) does not have enough cpu. The second domain (B) has enough cpu however all [27,28,29] pods do not get evicted since in the second iteration (i=1) they are already moved under domain A but without extending podsForEviction.

		{
			name: "5 domains, sizes [[1], [2], [4], [6], [8], [9]], maxSkew=2, NodeFit is enabled; should move 5",
			nodes: []*v1.Node{
				test.BuildTestNode("A", 100, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneA" }),
				test.BuildTestNode("B", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneB" }),
				test.BuildTestNode("C", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneC" }),
				test.BuildTestNode("D", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneD" }),
				test.BuildTestNode("E", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneE" }),
				test.BuildTestNode("F", 1000, 2000, 9, func(n *v1.Node) { n.Labels["zone"] = "zoneF" }),
			},
			pods: createTestPods([]testPodList{
				{
					count:       1,
					node:        "A",
					labels:      map[string]string{"foo": "bar"},
					constraints: getDefaultTopologyConstraints(2),
				},
				{
					count:       2,
					node:        "B",
					labels:      map[string]string{"foo": "bar"},
					constraints: getDefaultTopologyConstraints(2),
				},
				{
					count:       4,
					node:        "C",
					labels:      map[string]string{"foo": "bar"},
					constraints: getDefaultTopologyConstraints(2),
				},
				{
					count:       6,
					node:        "D",
					labels:      map[string]string{"foo": "bar"},
					constraints: getDefaultTopologyConstraints(2),
				},
				{
					count:       8,
					node:        "E",
					labels:      map[string]string{"foo": "bar"},
					constraints: getDefaultTopologyConstraints(2),
				},
				{
					count:       9,
					node:        "F",
					labels:      map[string]string{"foo": "bar"},
					constraints: getDefaultTopologyConstraints(2),
				},
			}),
			expectedEvictedCount: 5,
			strategy: api.DeschedulerStrategy{
				Params: &api.StrategyParameters{NodeFit: true},
			},
			namespaces: []string{"ns1"},
		},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me so far. Very cool to now have specific pod evictions being checked. I want to go over it a bit more consciously and circle back here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wonder about nodes with taints/affinity, which are not being removed (where not tolerated) prior to our domain size calculation.

but this is still an improvement to the current calculation causing eviction loops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been running this in production and seeing a lot of improvements. Any chance this can be a patch candidate for 0.24?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to sound lazy, but due to the work involved with patch releases and the timing for the next release, non-critical fixes are generally not worth the effort. I'm not opposed to it if there is a lot of demand though.

Copy link
Contributor Author

@a7i a7i Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood and makes sense. This was really urgent for us but we recently decided to run this fork for now so not as urgent anymore.

The graph below shows evictions loops from the same node
eviction-loop-same-node

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 7, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2022
@a7i a7i force-pushed the balancedomains-belowavg branch from 3f23278 to e50cfeb Compare June 20, 2022 23:10
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2022
@a7i a7i force-pushed the balancedomains-belowavg branch from e50cfeb to 3b94e35 Compare June 20, 2022 23:12
@a7i a7i force-pushed the balancedomains-belowavg branch from 3b94e35 to 7a5e67d Compare June 20, 2022 23:22
@a7i a7i changed the title TopologySpreadConstraint: do not evaluate already balanced nodes TopologySpreadConstraint: only evaluate nodes below ideal avg when balancing domains Jun 20, 2022
@a7i a7i requested review from JaneLiuL and ingvagabund July 1, 2022 16:46
Copy link
Contributor

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2022
Copy link
Member

@JaneLiuL JaneLiuL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
thanks your hard working @a7i

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 4, 2022
@k8s-ci-robot k8s-ci-robot merged commit aff9a0b into kubernetes-sigs:master Jul 4, 2022
@a7i a7i deleted the balancedomains-belowavg branch July 4, 2022 16:34
@a7i a7i restored the balancedomains-belowavg branch July 5, 2022 20:59
@a7i a7i deleted the balancedomains-belowavg branch August 26, 2023 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TopologySpreadConstraint strategy considers already balanced nodes when checking for nodeFit
6 participants