-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a parameter for making the thresholds of the LowNodeUtilization strategy relative to average values #473
Conversation
Hi @AmoVanB. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
||
// Logging specifies the options of logging. | ||
// Refer [Logs Options](https://github.com/kubernetes/component-base/blob/master/logs/options.go) for more information. | ||
Logging componentbaseconfig.LoggingConfiguration `json:"logging,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting a complaint from update-generated-conversions.sh
without that. Not sure if it's really needed though.
} else { | ||
lowResourceThreshold = map[v1.ResourceName]*resource.Quantity{ | ||
v1.ResourceCPU: resource.NewMilliQuantity(int64(float64(strategyConfig.Thresholds[v1.ResourceCPU])*float64(nodeCapacity.Cpu().MilliValue())*0.01), resource.DecimalSI), | ||
v1.ResourceMemory: resource.NewQuantity(int64(float64(strategyConfig.Thresholds[v1.ResourceMemory])*float64(nodeCapacity.Memory().Value())*0.01), resource.BinarySI), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is resource.BinarySI
correct here? I took that from the existing code, but I'm surprised that this is different from resource.DecimalSI
for the CPU and pods. Not sure about the purpose/meaning of that parameter though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they're different because for CPU, 5 cores = 5 * 1000m, but for Memory, 5GiB = 5 * 1024 MiB.
@@ -103,7 +104,7 @@ func TestLowNodeUtilization(t *testing.T) { | |||
}, | |||
n2NodeName: { | |||
Items: []v1.Pod{ | |||
*test.BuildTestPod("p9", 400, 0, n1NodeName, test.SetRSOwnerRef), | |||
*test.BuildTestPod("p9", 400, 0, n2NodeName, test.SetRSOwnerRef), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in the 3 following changes, I thought that was a typo as the pod is actually supposed to be on node 2. Not sure if that is either important or correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this changes into a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already fixed on master by this commit.
continue | ||
} | ||
|
||
// 2. for each topologySpreadConstraint in that namespace | ||
klog.V(1).InfoS("Processing "+strconv.Itoa(len(namespaceTopologySpreadConstraints))+" constraints", "namespace", namespace.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added some logs which I think are very helpful for simple debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these debug logs should be in this level, you can set it to level 3.
946d17b
to
326d175
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AmoVanB. Thanks for your PR.
If we run lowNodeUtilization
periodically in a dynamic cluster, we may need to change Thresholds
or TargetThresholds
once the total utilization of the cluster changes. So it makes sense to make Thresholds
and TargetThresholds
change with the average resources usage of the cluster.
/ok-to-test
/kind feature
|`thresholds`|map(string:int)| | ||
|`targetThresholds`|map(string:int)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider to change the name of these thresholds to low
and high
to make it easier to understand, that's what @ingvagabund suggests here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we were discussing renaming the params few times back. Wrt. bumping version of the strategy config as well. E.g. reorganize the data type so we can allow to configure the same strategy multiple times with different configuration.
Though, we can still add new params, deprecate the current ones and remove them after 3 releases.
continue | ||
} | ||
|
||
// 2. for each topologySpreadConstraint in that namespace | ||
klog.V(1).InfoS("Processing "+strconv.Itoa(len(namespaceTopologySpreadConstraints))+" constraints", "namespace", namespace.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these debug logs should be in this level, you can set it to level 3.
} else { | ||
lowResourceThreshold = map[v1.ResourceName]*resource.Quantity{ | ||
v1.ResourceCPU: resource.NewMilliQuantity(int64(float64(strategyConfig.Thresholds[v1.ResourceCPU])*float64(nodeCapacity.Cpu().MilliValue())*0.01), resource.DecimalSI), | ||
v1.ResourceMemory: resource.NewQuantity(int64(float64(strategyConfig.Thresholds[v1.ResourceMemory])*float64(nodeCapacity.Memory().Value())*0.01), resource.BinarySI), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they're different because for CPU, 5 cores = 5 * 1000m, but for Memory, 5GiB = 5 * 1024 MiB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions before a proper review of this PR. I like the idea in overall.
targetThresholds := strategy.Params.NodeResourceUtilizationThresholds.TargetThresholds | ||
if err := validateStrategyConfig(thresholds, targetThresholds); err != nil { | ||
strategyConfig := strategy.Params.NodeResourceUtilizationThresholds | ||
if err := validateStrategyConfig(strategyConfig); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the changes generalizing thresholds
and targetThresholds
into strategyConfig.thresholds
and strategyConfig.targetThresholds
into a separate commit. It will be easier to follow the changes.
@@ -114,9 +115,10 @@ func RemovePodsViolatingTopologySpreadConstraint( | |||
(len(excludedNamespaces) > 0 && excludedNamespaces.Has(namespace.Name)) { | |||
continue | |||
} | |||
klog.V(1).InfoS("Processing namespace", "namespace", namespace.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put changes in TSC into its own PR as it's not related to the deviation feature.
@@ -103,7 +104,7 @@ func TestLowNodeUtilization(t *testing.T) { | |||
}, | |||
n2NodeName: { | |||
Items: []v1.Pod{ | |||
*test.BuildTestPod("p9", 400, 0, n1NodeName, test.SetRSOwnerRef), | |||
*test.BuildTestPod("p9", 400, 0, n2NodeName, test.SetRSOwnerRef), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this changes into a separate PR.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: AmoVanB The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@AmoVanB: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We should get #434 merged first though. |
I'm not working on that anymore. I tried to rebase but there are a couple of non-obvious conflicts. I unfortunately won't have time to work on rebasing and fixing that. Hopefully someone can take over! :) |
Thanks @AmoVanB, I will take over from here ;) @ingvagabund, Shall I open another PR with the required changes? |
@matthieu-eck by all means. Thank you for continuing the work @AmoVanB started!!! |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@AmoVanB: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
…sigs#473 feat(LowNodeUtilization): useDeviationThresholds, redo of kubernetes-sigs#473
…sigs#473 feat(LowNodeUtilization): useDeviationThresholds, redo of kubernetes-sigs#473 feat(LowNodeUtilization): useDeviationThresholds, redo of kubernetes-sigs#473
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
…sigs#473 [751]: normalize Percentage in nodeutilization and clean the tests
feat: Add DeviationThreshold Paramter for LowNodeUtilization, (Previous attempt - #473 )
Closing in favor of #751 |
@ingvagabund: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Reason for this PR
Currently, the thresholds that can be set for the
LodNodeUtilization
strategy only allow to make sure that nodes aren't over or underutilized compared to thresholds relative to their capacities. However, sometimes, it's desirable to make sure all nodes are "similarly" utilized to ensure the proper balance of a cluster. While this is possible to implement using the current thresholds, the values needed to achieve such a balancing behavior would be dependent on the number of nodes, pods, and their memory/CPU requests, which is not very practical to configure, especially for dynamic clusters.Feature of this PR
This PR enables the
descheduler
to achieve such a goal. We define an additionaluseDeviationThresholds
boolean parameter (the name is maybe not the best, it can change). Iffalse
, thedescheduler
behaves as now. Iftrue
, the thresholds are considered as percentage deviations relative to the average utilizations among all nodes.For example. Considering only the memory metric. If nodes have utilizations of [10%, 20%, 30%, 20%] and both
threshold
andtargetThreshold
are 5%. The average node utilization is (10% + 20% + 30% + 20%) / 4 = 20%. The first node has an utilization of 10%, which is lower than 20% (average) - 5% (threshold). The node is hence considered underutilized. The third node has an utilization of 30%, which is greater than 20% (average) + 5% (targetThreshold). The node is hence considered overutilized. Both other nodes have utilizations within the [20% - 5%, 20% + 5%] window and are hence considered appropriately utilized.Once nodes are labeled as under- or overutilized, the strategy behaves exactly the same as before.
Code style comment
Since the only change compared to the original strategy is the way nodes are categorized as over- and underutilized, I thought it is best to include that in the existing
LowNodeUtilization
strategy and not to create a new strategy. Another option would be to create two distinct strategies that share methods. I went for the first option as that was the easiest to implement. I would definitely not be against the second option.Test
We had this need/problem in our 200+ pods cluster. We deployed the
descheduler
from this PR with success in our test clusters. It passes our tests (that make sure that the correct pods are evicted when the cluster is unbalanced) and works as expected and without any detected issues for now.Vision
In general, the proposed feature goes into the direction of a more general intention we have with the
descheduler
: take the state of the cluster, smartly compute a new schedule for all the pods (not one by one but all at once to leverage the complete knowledge we have) according to a given strategy (ours: balanced utilization), and then implementing the transition from the current state to the desired state. This PR is our first (small) step towards that goal but is there such a plan for thedescheduler
? Overall, that would turn thedescheduler
into a "cluster scheduler" in contrast to the current "per-pod scheduler" of Kubernetes. I know there is a plan to eventually incorporate scheduling in the tool and not to rely on the existing scheduler, but not sure if you're looking into such optimization-driven strategies.