Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Show available Runtime accelerators to users #2355

Open
andreyvelich opened this issue Dec 13, 2024 · 1 comment
Open

[SDK] Show available Runtime accelerators to users #2355

andreyvelich opened this issue Dec 13, 2024 · 1 comment

Comments

@andreyvelich
Copy link
Member

What you would like to be added?

As part of initial Kubeflow Training V2 SDK implementation, we introduced label which identifies what accelerators are used by Training Runtime: #2324 (comment).

If Training Runtime has the training.kubeflow.org/accelerator: GPU-Tesla-V100-16GB label, we add this value in the Runtime class. Additionally, we get number of CPU, GPU, or TPU devices from the TrainJob's containers and insert this value into the TrainJob's Component class.

However, it conflicts with other Kubernetes primitives (e.g. nodeSelectors, tolerations, etc.) and Kueue configurations like resources flavours.

We should discuss what is the right way to explain users available hardware resources when they are using Training Runtimes.

cc @franciscojavierarceo @kubeflow/wg-training-leads @astefanutti @Electronic-Waste @seanlaii @kannon92

Why is this needed?

Data Scientists and ML Engineers should understand which accelerators are available for them while using the Training Runtimes.

In the future, we could potentially use these values to automatically assign model and data tensors to the appropriate harware devices while using the Kubeflow Training SDK.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@kannon92
Copy link
Contributor

At a quick glance, I think you may need K8s primitives (nodeSelectors, tolerations) and Kueue resources like resource flavors/ClusterQueues.

If you want to make the sdk data science friendly, I tend to think that you want to specify training runtimes and that "glue" could set the k8s primitives. I also think you may need to include things like images there too.

I see that a user would say I want a AMD GPU and the admin has a series of values that will get added to their podspec to help achieve that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants