Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add nodegroup plugin #3132

Merged
merged 3 commits into from
Jan 17, 2024
Merged

Conversation

wuyueandrew
Copy link
Contributor

#3131 add nodegroup plugin with ut

@volcano-sh-bot
Copy link
Contributor

Welcome @wuyueandrew!

It looks like this is your first PR to volcano-sh/volcano 馃帀.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 20, 2023
@wuyueandrew
Copy link
Contributor Author

/assign @Thor-wl

@william-wang
Copy link
Member

Thanks for your contribution and testing. We will review it as soon as possible.

@wuyueandrew
Copy link
Contributor Author

Thanks for your contribution and testing. We will review it as soon as possible.

lint err fixed plz restart ci pipeline

@Monokaix
Copy link
Member

Monokaix commented Oct 7, 2023

Thanks for your contribution! Please add a user-guide for this plugin refer to #3149.

@wuyueandrew
Copy link
Contributor Author

Thanks for your contribution! Please add a user-guide for this plugin refer to #3149.
Doc of node-group was added before, link:
https://github.com/volcano-sh/volcano/blob/master/docs/design/node-group.md

@Monokaix
Copy link
Member

Hi, please take a look at review and squash your commits : )

@@ -0,0 +1,229 @@
/*
Copyright 2019 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the copyright to be the right one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,229 @@
/*
Copyright 2019 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change 2019 to 2023.

// - name: predicates
// - name: proportion
// - name: nodegroup
// enablePredicate: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This arg enablePredicate seems not used but exists in example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will remove this argument

queueGroupAffinityPreferred: make(map[string][]string, 0),
}
}
func (q queueGroupAffinity) predicate(queue, group string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If node has no nodeGroupLabel, predicate will not fit, is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider of our scenery, nodeGroupLabel is strict required. if there is a node without nodeGroupLabel, pod cannot assigned to this node.

}
if q.queueGroupAffinityPreferred != nil {
if groups, ok := q.queueGroupAffinityPreferred[queue]; ok {
if !contains(groups, group) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be contains.

if q.queueGroupAffinityPreferred != nil {
if groups, ok := q.queueGroupAffinityPreferred[queue]; ok {
if !contains(groups, group) {
return 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AntiAffinity and Affinity should be both considered when scoring, and compute a weighted value finally.

return &nodeGroupPlugin{pluginArguments: arguments}
}

func (pp *nodeGroupPlugin) Name() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better rename this to np or something to be consistent with other receiver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np is better

"volcano.sh/volcano/pkg/scheduler/util"
)

func TestScore(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ut tests score function, but it seems only predicate is tested because there is only RequiredDuringSchedulingIgnoredDuringExecution item in queue.spec. Please test both score and predict function.

@william-wang
Copy link
Member

@wuyueandrew Please handle the confliction in the PR and the ci failure:)

@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 8, 2023
@wuyueandrew wuyueandrew force-pushed the nodegroup-issue3131 branch 2 times, most recently from b8c5755 to 7635571 Compare November 8, 2023 06:29
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 8, 2023
queueGroupAffinity := NewQueueGroupAffinity()
for _, queue := range ssn.Queues {
affinity := queue.Queue.Spec.Affinity
if nil == affinity {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To uniform code style, please move nil to front.

}

type queueGroupAffinity struct {
queueGroupAntiAffinityRequired map[string][]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value is a array type is not so good, if user mis-config a lots of duplicated label, it's unacceptable, and sets is better, and it can de-duplicate and is more efficient.

if q.queueGroupAffinityRequired != nil {
if groups, ok := q.queueGroupAffinityRequired[queue]; ok {
if contains(groups, group) {
flag = flag || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| is necessary here, and preferred items should not be considered in predict.

}
}

if q.queueGroupAntiAffinityRequired != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AntiAffinityRequired should be filtered in predict, so it should not exists in score

if q.queueGroupAntiAffinityPreferred != nil {
if groups, ok := q.queueGroupAntiAffinityPreferred[queue]; ok {
if contains(groups, group) {
nodeScore = 0.25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value should be -1


func NewQueueGroupAffinity() queueGroupAffinity {
return queueGroupAffinity{
queueGroupAntiAffinityRequired: make(map[string]sets.Set[string], 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 can be omitted.

if q.queueGroupAffinityPreferred != nil {
if groups, ok := q.queueGroupAffinityPreferred[queue]; ok {
if groups.Has(group) {
nodeScore = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set a base constant value 100.

}
}
if q.queueGroupAffinityPreferred != nil {
if groups, ok := q.queueGroupAffinityPreferred[queue]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer node get higher score than required?

@@ -39,7 +39,7 @@ case2: recommend queue can use private cloud nodes or public cloud nodes, but tt

affinity configure:
1. affinity.nodeGroupAffinity.requiredDuringSchedulingIgnoredDuringExecution, hard constraints, such as `nlp = nodegroup1,nodegroup2`, it means that task in queue=nlp can ony run on the nodes in nodegroup1 or nodegroup2.
2. affinity.nodeGroupAffinity.preferredDuringSchedulingIgnoredDuringExecution, soft constraints, such as `nlp = nodegroup1`, it means that task in queue=nlp runs on nodegroup1 first, but if the resources of nodegroup1 is insufficient, it can also run on other nodegroups.
2. affinity.nodeGroupAffinity.preferredDuringSchedulingIgnoredDuringExecution, soft constraints, such as `nlp = nodegroup1`, it means that task in queue=nlp runs on other nodegroups first, but if the resources is insufficient, it can also run on nodegroup1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other nodegroups is misunderstanding, other nodesgroups specified in affinity.nodeGroupAffinity.requiredDuringSchedulingIgnoredDuringExecution is better.


## Introduction

**Nodegroup plugin** is designed to isolate resources by assigning labels to nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also add: and set node label affinty on Queue.

preferredDuringSchedulingIgnoredDuringExecution:
- groupname3
```
This implies that tasks in the "default" queue will be executed on "groupname1" and "groupname2", with a preference for "groupname1" to run first. Tasks are restricted from running on "groupname3" and "groupname4". However, if the resources in other node groups are insufficient, the task can run on "nodegroup3".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executed -> scheduled


The nodegroup design document provides the most detailed information about the node group. There are some tips to help avoid certain issues.

1. Soft constraints are a subset of hard constraints, including both affinity and anti-affinity. Consider a queue setup as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following steps are both about queue's configuration, should also add other steps, like set node lable, and submit jobs to a queue and verfiy that jobs are scheduled to affinity nodes and schedule failed on antiaffinity nodes.

}

status, _ := ssn.PredicateFn(task, node)
// if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented codes.

"volcano.sh/volcano/pkg/scheduler/util"
)

func TestNormal(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace it with a more descriptive name.

expectedStatus map[string]map[string]int
}{
{
name: "normal queue test",
Copy link
Member

@Monokaix Monokaix Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

},
},
{
name: "special queue job",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

@wuyueandrew wuyueandrew force-pushed the nodegroup-issue3131 branch 4 times, most recently from 988b0c5 to a1e11b4 Compare December 27, 2023 06:36

### validate queue affinity and antiAffinity rules is effected

Query pod information and verify whether the pod has been assigned to the correct node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please illustrate which node is correct and which node is filtered.

group := node.Node.Labels[NodeGroupNameKey]
queue := GetPodQueue(task)
score := queueGroupAffinity.score(queue, group)
klog.V(3).Infof("task <%s>/<%s> queue %s on node %s of nodegroup %s, score %v", task.Namespace, task.Name, queue, node.Name, group, score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

klog.v(4)

@wuyueandrew wuyueandrew force-pushed the nodegroup-issue3131 branch 2 times, most recently from ec37291 to 89c1cbb Compare December 28, 2023 02:41
Signed-off-by: wuyue <wuyue_andrew@outlook.com>
Signed-off-by: wuyue <wuyue_andrew@outlook.com>
how to use nodegroup doc

how to use nodegroup doc

Signed-off-by: wuyue <wuyue_andrew@outlook.com>

how to use nodegroup doc

how to use nodegroup doc

Signed-off-by: wuyue <wuyue_andrew@outlook.com>

how to use nodegroup doc

Signed-off-by: wuyue <wuyue_andrew@outlook.com>

how to use nodegroup doc

Signed-off-by: wuyue <wuyue_andrew@outlook.com>
Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2024
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 17, 2024
@volcano-sh-bot volcano-sh-bot merged commit 2110527 into volcano-sh:master Jan 17, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants