-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dynamic percentage of node scoring to user docs #12235
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,31 +8,39 @@ weight: 70 | |
|
||
{{% capture overview %}} | ||
|
||
{{< feature-state for_k8s_version="1.12" >}} | ||
{{< feature-state for_k8s_version="1.14" state="beta" >}} | ||
|
||
Kube-scheduler is the Kubernetes default scheduler. It is responsible for | ||
placement of Pods on Nodes in a cluster. Nodes in a cluster that meet the | ||
scheduling requirements of a Pod are called "feasible" Nodes for the Pod. The | ||
scheduler finds feasible Nodes for a Pod and then runs a set of functions to | ||
score the feasible Nodes and picks a Node with the highest score among the | ||
feasible ones to run the Pod. The scheduler then notifies the API server about this | ||
decision in a process called "Binding". | ||
feasible ones to run the Pod. The scheduler then notifies the API server about | ||
this decision in a process called "Binding". | ||
|
||
{{% /capture %}} | ||
|
||
{{% capture body %}} | ||
|
||
## Percentage of Nodes to Score | ||
|
||
Before Kubernetes 1.12, Kube-scheduler used to check the feasibility of all the | ||
nodes in a cluster and then scored the feasible ones. Kubernetes 1.12 has a new | ||
feature that allows the scheduler to stop looking for more feasible nodes once | ||
it finds a certain number of them. This improves the scheduler's performance in | ||
large clusters. The number is specified as a percentage of the cluster size and | ||
is controlled by a configuration option called `percentageOfNodesToScore`. The | ||
range should be between 1 and 100. Other values are considered as 100%. The | ||
default value of this option is 50%. A cluster administrator can change this value by providing a | ||
different value in the scheduler configuration. However, it may not be necessary to change this value. | ||
Before Kubernetes 1.12, Kube-scheduler used to check the feasibility of all | ||
nodes in a cluster and then scored the feasible ones. Kubernetes 1.12 added a | ||
new feature that allows the scheduler to stop looking for more feasible nodes | ||
once it finds a certain number of them. This improves the scheduler's | ||
performance in large clusters. The number is specified as a percentage of the | ||
cluster size. The percentage can be controlled by a configuration option called | ||
`percentageOfNodesToScore`. The range should be between 1 and 100. Larger values | ||
are considered as 100%. Zero is equivalent to not providing the config option. | ||
Kubernetes 1.14 has logic to find the percentage of nodes to score based on the | ||
size of the cluster if it is not specified in the configuration. It uses a | ||
linear formula which yields 50% for a 100-node cluster. The formula yields 10% | ||
for a 5000-node cluster. The lower bound for the automatic value is 5%. In other | ||
words, the scheduler always scores at least 5% of the cluster no matter how | ||
large the cluster is, unless the user provides the config option with a value | ||
smaller than 5. | ||
|
||
Below is an example configuration that sets `percentageOfNodesToScore` to 50%. | ||
|
||
```yaml | ||
apiVersion: componentconfig/v1alpha1 | ||
|
@@ -45,41 +53,37 @@ algorithmSource: | |
percentageOfNodesToScore: 50 | ||
``` | ||
|
||
{{< note >}} | ||
In clusters with zero or less than 50 feasible nodes, the | ||
scheduler still checks all the nodes, simply because there are not enough | ||
feasible nodes to stop the scheduler's search early. | ||
{{< /note >}} | ||
{{< note >}} In clusters with less than 50 feasible nodes, the scheduler still | ||
checks all the nodes, simply because there are not enough feasible nodes to stop | ||
the scheduler's search early. {{< /note >}} | ||
|
||
**To disable this feature**, you can set `percentageOfNodesToScore` to 100. | ||
|
||
### Tuning percentageOfNodesToScore | ||
|
||
`percentageOfNodesToScore` must be a value between 1 and 100 | ||
with the default value of 50. There is also a hardcoded minimum value of 50 | ||
nodes which is applied internally. The scheduler tries to find at | ||
least 50 nodes regardless of the value of `percentageOfNodesToScore`. This means | ||
that changing this option to lower values in clusters with several hundred nodes | ||
will not have much impact on the number of feasible nodes that the scheduler | ||
tries to find. This is intentional as this option is unlikely to improve | ||
performance noticeably in smaller clusters. In large clusters with over a 1000 | ||
nodes setting this value to lower numbers may show a noticeable performance | ||
improvement. | ||
`percentageOfNodesToScore` must be a value between 1 and 100 with the default | ||
value being calculated based on the cluster size. There is also a hardcoded | ||
minimum value of 50 nodes. This means that changing | ||
this option to lower values in clusters with several hundred nodes will not have | ||
much impact on the number of feasible nodes that the scheduler tries to find. | ||
This is intentional as this option is unlikely to improve performance noticeably | ||
in smaller clusters. In large clusters with over a 1000 nodes setting this value | ||
to lower numbers may show a noticeable performance improvement. | ||
|
||
An important note to consider when setting this value is that when a smaller | ||
number of nodes in a cluster are checked for feasibility, some nodes are not | ||
sent to be scored for a given Pod. As a result, a Node which could possibly | ||
score a higher value for running the given Pod might not even be passed to the | ||
scoring phase. This would result in a less than ideal placement of the Pod. For | ||
this reason, the value should not be set to very low percentages. A general rule | ||
of thumb is to never set the value to anything lower than 30. Lower values | ||
of thumb is to never set the value to anything lower than 10. Lower values | ||
should be used only when the scheduler's throughput is critical for your | ||
application and the score of nodes is not important. In other words, you prefer | ||
to run the Pod on any Node as long as it is feasible. | ||
|
||
It is not recommended to lower this value from its default if your cluster has | ||
only several hundred Nodes. It is unlikely to improve the scheduler's | ||
performance significantly. | ||
If your cluster has several hundred Nodes or fewer, we do not recommend lowering | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good rephrasing. |
||
the default value of this configuration option. It is unlikely to improve the | ||
scheduler's performance significantly. | ||
|
||
### How the scheduler iterates over Nodes | ||
|
||
|
@@ -91,8 +95,8 @@ for running Pods, the scheduler iterates over the nodes in a round robin | |
fashion. You can imagine that Nodes are in an array. The scheduler starts from | ||
the start of the array and checks feasibility of the nodes until it finds enough | ||
Nodes as specified by `percentageOfNodesToScore`. For the next Pod, the | ||
scheduler continues from the point in the Node array that it stopped at when checking | ||
feasibility of Nodes for the previous Pod. | ||
scheduler continues from the point in the Node array that it stopped at when | ||
checking feasibility of Nodes for the previous Pod. | ||
|
||
If Nodes are in multiple zones, the scheduler iterates over Nodes in various | ||
zones to ensure that Nodes from different zones are considered in the | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I can tell that you have completely captured the behavior with this description, but most readers probably won't want to solve linear equations to parse this. How about we give either a table with a few extra data points, or add a chart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe most users will be happy with the default behavior and may not care about the actual value for their clusters. Dynamic changes of the cluster size, for example due to autoscaling, make this value variable by time. So, I am not sure if providing a table for various cluster sizes is very valuable. Most users should not ever need to think about this option.