Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Add dynamic resource allocation(DRA) plugin #3799

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

JesseStutler
Copy link
Member

@JesseStutler JesseStutler commented Oct 31, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Background

Inspired by #3798 , introduce DRA into volcano. It solves the following things:

See docs/design/dynamicresources-plugin.md for more details.

Verfication

I followed up https://github.com/kubernetes-sigs/dra-example-driver to deploy an example resource driver and deploy the demo yaml gpu-test{1,2,3,4,5}.yaml, the pod can be successfully scheduled:
image
GPU information printed in containers:
image
And allocation result in ResourceClaim is successfully updated:
image

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign kevin-wangzefeng
You can assign the PR to them by writing /assign @kevin-wangzefeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 31, 2024
@googs1025
Copy link
Member

/cc

@@ -25,6 +25,7 @@ custom:
leader_elect_enable: false
enabled_admissions: "/jobs/mutate,/jobs/validate,/podgroups/mutate,/pods/validate,/pods/mutate,/queues/mutate,/queues/validate"
colocation_enable: false
feature_gates: DynamicResourceAllocation=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to enable this feature-gate by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? Do we need to enable it by default? I think that if users need to enable dra plugin, they also need to reconfigure the scheduler plugins configmap to enable it, it also looks like featuregate. If we don't enable this feture-gate, users need to do reconfiguring the scheduler plugins configmap and enabling the feature-gate two steps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we need not to enable this feature-gate by default, I will add some informers in schedulerCache, some users may need to use k8s below v1.31, if there are no APIs like DeviceClass, ResourceClaim, etc registered, it will cause some errors.

@@ -0,0 +1,808 @@
/*
Copyright 2022 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part directly imported into the scheduler code? Or has it been modified and adapted (I don't know much about it)? If not, should it be marked in the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed the New to NewDRAPlugin and deleted EventsToRegister and some related codes with it, I think volcano doesn't need them right now. These codes are copied from the kubernetes repo, so the license needs to keep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand, if so, if the upstream code changes, do I need to make the same changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, before the dynamicResources struct changed to DynamicResources, we can only copy the whole codes related into volcano, otherwise we can't directly call those extension points. It may cause us to miss tracking some repaired codes, but if we can access the DynamicResources struct in v1.32, the problem will be solved.

@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 31, 2024
@JesseStutler JesseStutler force-pushed the dra_dev branch 2 times, most recently from de77070 to b120b42 Compare November 1, 2024 09:37
@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 16, 2024
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2024
Signed-off-by: jessestutler <chenzicong4@huawei.com>
Signed-off-by: JesseStutler <chenzicong4@huawei.com>
Signed-off-by: jessestutler <chenzicong4@huawei.com>
Signed-off-by: jessestutler <chenzicong4@huawei.com>
@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants